[GE users] Strange behavior with sge_qmaster

Sean Dilda agrajag at dragaera.net
Tue Jul 6 14:23:43 BST 2004

In my SGE6 cluster (using classic spooling over NFS) I've started to see
some weird behavior over this weekend.  There are a couple of things
I've noticed, and I'm not certain if they're related or not.  I was
wondering if anyone else here might have seen them and might know what's
going on.

The first thing is that with parallel jobs, it seems that there are some
files that sge_qmaster likes to rewrite somewhat often.  Under the
sge_qmaster spool dir's jobs/ directory.  There is a file for each
non-parallel job and a directory for each parallel job.  It seems that
with non-parallel jobs, the file gets written once at the start and
that's it.  However, in the directory for parallel jobs, there's a file
for each processor requested, and a couple of extra 'common' files.  For
some reason sge_qmaster wants to rewrite all of these files at least
once a minute.  With several large jobs running at once and the spooldir
sitting on NFS, this can create a bit of load/slowdown for sge_qmaster. 
It does seem odd that the parallel files have to keep getting rewritten
when the non-parallel ones don't.  If nothing else, I'm pretty sure this
is responsible for the slow response time (several seconds) I get from
running commands like 'qstat'.

Something else I've had happen several times this weekend is that SGE
will stop scheduling jobs.  There will be several jobs submitted to SGE,
for which there are resources, yet SGE will not launch the jobs.  If I
shut down sge_qmaster, then start it up again, those jobs are launched
immediately.  I have a feeling that the scheduling loop may be
stopping.  I have schedd_job_info set to false.  However when this
occurs, I change it to true, yet no matter how long I wait, scheduling
info for the jobs never shows up.  Originally I had flush_submit_sec and
flush_finish_sec set to '1'.  However when this started I changed them
back to '0', but the problem didn't go away.

We also have this problem where a user will 'qdel' a parallel job.  The
job will be killed on all the nodes, however it will continue to show up
in 'dr' state in qstat until I 'qdel -f' it.

Any help would be appreciated,


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list