[GE users] Strange behavior with sge_qmaster

Andy Schwierskott andy.schwierskott at sun.com
Tue Jul 6 14:45:05 BST 2004


Sean,

> In my SGE6 cluster (using classic spooling over NFS) I've started to see
> some weird behavior over this weekend.  There are a couple of things
> I've noticed, and I'm not certain if they're related or not.  I was
> wondering if anyone else here might have seen them and might know what's
> going on.
>
> The first thing is that with parallel jobs, it seems that there are some
> files that sge_qmaster likes to rewrite somewhat often.  Under the
> sge_qmaster spool dir's jobs/ directory.  There is a file for each
> non-parallel job and a directory for each parallel job.  It seems that
> with non-parallel jobs, the file gets written once at the start and
> that's it.  However, in the directory for parallel jobs, there's a file
> for each processor requested, and a couple of extra 'common' files.  For
> some reason sge_qmaster wants to rewrite all of these files at least
> once a minute.  With several large jobs running at once and the spooldir
> sitting on NFS, this can create a bit of load/slowdown for sge_qmaster.
> It does seem odd that the parallel files have to keep getting rewritten
> when the non-parallel ones don't.  If nothing else, I'm pretty sure this
> is responsible for the slow response time (several seconds) I get from
> running commands like 'qstat'.

--> Please file it as a bug in Issuezilla.


> Something else I've had happen several times this weekend is that SGE
> will stop scheduling jobs.  There will be several jobs submitted to SGE,
> for which there are resources, yet SGE will not launch the jobs.  If I
> shut down sge_qmaster, then start it up again, those jobs are launched
> immediately.  I have a feeling that the scheduling loop may be
> stopping.  I have schedd_job_info set to false.  However when this
> occurs, I change it to true, yet no matter how long I wait, scheduling
> info for the jobs never shows up.  Originally I had flush_submit_sec and
> flush_finish_sec set to '1'.  However when this started I changed them
> back to '0', but the problem didn't go away.

--> dto. Please provide more information, e.g. what does

    qconf -sss

show? If qmaster doesn't get order from scheduler you will get a "no
scheduling host defined" answer.

    Is the scheduler busy (at least from time to time?)


> We also have this problem where a user will 'qdel' a parallel job.  The
> job will be killed on all the nodes, however it will continue to show up
> in 'dr' state in qstat until I 'qdel -f' it.

So is it a tightly integrated parallel jobs? What type of parallel job is
it? MPICH?

Soes it mean that all child processes of the shepherd have gone away?

What's about the shepherd on the nodes itself?

Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list