[GE users] Strange behavior with sge_qmaster
agrajag at dragaera.net
Tue Jul 6 15:03:59 BST 2004
On Tue, 2004-07-06 at 09:45, Andy Schwierskott wrote:
> > In my SGE6 cluster (using classic spooling over NFS) I've started to see
> > some weird behavior over this weekend. There are a couple of things
> > I've noticed, and I'm not certain if they're related or not. I was
> > wondering if anyone else here might have seen them and might know what's
> > going on.
> > The first thing is that with parallel jobs, it seems that there are some
> > files that sge_qmaster likes to rewrite somewhat often. Under the
> > sge_qmaster spool dir's jobs/ directory. There is a file for each
> > non-parallel job and a directory for each parallel job. It seems that
> > with non-parallel jobs, the file gets written once at the start and
> > that's it. However, in the directory for parallel jobs, there's a file
> > for each processor requested, and a couple of extra 'common' files. For
> > some reason sge_qmaster wants to rewrite all of these files at least
> > once a minute. With several large jobs running at once and the spooldir
> > sitting on NFS, this can create a bit of load/slowdown for sge_qmaster.
> > It does seem odd that the parallel files have to keep getting rewritten
> > when the non-parallel ones don't. If nothing else, I'm pretty sure this
> > is responsible for the slow response time (several seconds) I get from
> > running commands like 'qstat'.
> --> Please file it as a bug in Issuezilla.
I'll try to get it in sometime today.
> > Something else I've had happen several times this weekend is that SGE
> > will stop scheduling jobs. There will be several jobs submitted to SGE,
> > for which there are resources, yet SGE will not launch the jobs. If I
> > shut down sge_qmaster, then start it up again, those jobs are launched
> > immediately. I have a feeling that the scheduling loop may be
> > stopping. I have schedd_job_info set to false. However when this
> > occurs, I change it to true, yet no matter how long I wait, scheduling
> > info for the jobs never shows up. Originally I had flush_submit_sec and
> > flush_finish_sec set to '1'. However when this started I changed them
> > back to '0', but the problem didn't go away.
> --> dto. Please provide more information, e.g. what does
> qconf -sss
> show? If qmaster doesn't get order from scheduler you will get a "no
> scheduling host defined" answer.
I'll try that next time the problem occurs.
> Is the scheduler busy (at least from time to time?)
I didn't actually check that before. Something else I'll check next
time I see the problem.
> > We also have this problem where a user will 'qdel' a parallel job. The
> > job will be killed on all the nodes, however it will continue to show up
> > in 'dr' state in qstat until I 'qdel -f' it.
> So is it a tightly integrated parallel jobs? What type of parallel job is
> it? MPICH?
> Soes it mean that all child processes of the shepherd have gone away?
> What's about the shepherd on the nodes itself?
These are tightly integrated parallel jobs. Its using MPICH and ssh.
On the nodes, the shepherd process is no longer running. And there are
no processes owned by the person who ran the job.
Here is my parallel environment setup:
start_proc_args /usr/share/sge/mpi/startmpi.sh $pe_hostfile
Thanks for the feedback,
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users