[GE users] Strange behavior with sge_qmaster

Andy Schwierskott andy.schwierskott at sun.com
Tue Jul 6 15:23:16 BST 2004


Sean,

please file an Issue for the qdel problem of the MPI jobs as well. This
definitely is a bug when after the end of the shepherd the job doesn't go
away.

Andy

On Tue, 6 Jul 2004, Sean Dilda wrote:

> On Tue, 2004-07-06 at 09:45, Andy Schwierskott wrote:
> > Sean,
> >
> > > In my SGE6 cluster (using classic spooling over NFS) I've started to see
> > > some weird behavior over this weekend.  There are a couple of things
> > > I've noticed, and I'm not certain if they're related or not.  I was
> > > wondering if anyone else here might have seen them and might know what's
> > > going on.
> > >
> > > The first thing is that with parallel jobs, it seems that there are some
> > > files that sge_qmaster likes to rewrite somewhat often.  Under the
> > > sge_qmaster spool dir's jobs/ directory.  There is a file for each
> > > non-parallel job and a directory for each parallel job.  It seems that
> > > with non-parallel jobs, the file gets written once at the start and
> > > that's it.  However, in the directory for parallel jobs, there's a file
> > > for each processor requested, and a couple of extra 'common' files.  For
> > > some reason sge_qmaster wants to rewrite all of these files at least
> > > once a minute.  With several large jobs running at once and the spooldir
> > > sitting on NFS, this can create a bit of load/slowdown for sge_qmaster.
> > > It does seem odd that the parallel files have to keep getting rewritten
> > > when the non-parallel ones don't.  If nothing else, I'm pretty sure this
> > > is responsible for the slow response time (several seconds) I get from
> > > running commands like 'qstat'.
> >
> > --> Please file it as a bug in Issuezilla.
>
> I'll try to get it in sometime today.
>
> >
> >
> > > Something else I've had happen several times this weekend is that SGE
> > > will stop scheduling jobs.  There will be several jobs submitted to SGE,
> > > for which there are resources, yet SGE will not launch the jobs.  If I
> > > shut down sge_qmaster, then start it up again, those jobs are launched
> > > immediately.  I have a feeling that the scheduling loop may be
> > > stopping.  I have schedd_job_info set to false.  However when this
> > > occurs, I change it to true, yet no matter how long I wait, scheduling
> > > info for the jobs never shows up.  Originally I had flush_submit_sec and
> > > flush_finish_sec set to '1'.  However when this started I changed them
> > > back to '0', but the problem didn't go away.
> >
> > --> dto. Please provide more information, e.g. what does
> >
> >     qconf -sss
> >
> > show? If qmaster doesn't get order from scheduler you will get a "no
> > scheduling host defined" answer.
> >
>
> I'll try that next time the problem occurs.
>
> >     Is the scheduler busy (at least from time to time?)
>
> I didn't actually check that before.  Something else I'll check next
> time I see the problem.
>
> >
> >
> > > We also have this problem where a user will 'qdel' a parallel job.  The
> > > job will be killed on all the nodes, however it will continue to show up
> > > in 'dr' state in qstat until I 'qdel -f' it.
> >
> > So is it a tightly integrated parallel jobs? What type of parallel job is
> > it? MPICH?
> >
> > Soes it mean that all child processes of the shepherd have gone away?
> >
> > What's about the shepherd on the nodes itself?
> >
>
> These are tightly integrated parallel jobs.  Its using MPICH and ssh.
> On the nodes, the shepherd process is no longer running.  And there are
> no processes owned by the person who ran the job.
>
> Here is my parallel environment setup:
>
> pe_name           low-all
> slots             1024
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/share/sge/mpi/startmpi.sh $pe_hostfile
> stop_proc_args    /usr/share/sge/mpi/stopmpi.sh
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
>
> Thanks for the feedback,
>
>
> Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list