[GE users] Strange behavior with sge_qmaster
andy.schwierskott at sun.com
Tue Jul 6 15:23:16 BST 2004
please file an Issue for the qdel problem of the MPI jobs as well. This
definitely is a bug when after the end of the shepherd the job doesn't go
On Tue, 6 Jul 2004, Sean Dilda wrote:
> On Tue, 2004-07-06 at 09:45, Andy Schwierskott wrote:
> > Sean,
> > > In my SGE6 cluster (using classic spooling over NFS) I've started to see
> > > some weird behavior over this weekend. There are a couple of things
> > > I've noticed, and I'm not certain if they're related or not. I was
> > > wondering if anyone else here might have seen them and might know what's
> > > going on.
> > >
> > > The first thing is that with parallel jobs, it seems that there are some
> > > files that sge_qmaster likes to rewrite somewhat often. Under the
> > > sge_qmaster spool dir's jobs/ directory. There is a file for each
> > > non-parallel job and a directory for each parallel job. It seems that
> > > with non-parallel jobs, the file gets written once at the start and
> > > that's it. However, in the directory for parallel jobs, there's a file
> > > for each processor requested, and a couple of extra 'common' files. For
> > > some reason sge_qmaster wants to rewrite all of these files at least
> > > once a minute. With several large jobs running at once and the spooldir
> > > sitting on NFS, this can create a bit of load/slowdown for sge_qmaster.
> > > It does seem odd that the parallel files have to keep getting rewritten
> > > when the non-parallel ones don't. If nothing else, I'm pretty sure this
> > > is responsible for the slow response time (several seconds) I get from
> > > running commands like 'qstat'.
> > --> Please file it as a bug in Issuezilla.
> I'll try to get it in sometime today.
> > > Something else I've had happen several times this weekend is that SGE
> > > will stop scheduling jobs. There will be several jobs submitted to SGE,
> > > for which there are resources, yet SGE will not launch the jobs. If I
> > > shut down sge_qmaster, then start it up again, those jobs are launched
> > > immediately. I have a feeling that the scheduling loop may be
> > > stopping. I have schedd_job_info set to false. However when this
> > > occurs, I change it to true, yet no matter how long I wait, scheduling
> > > info for the jobs never shows up. Originally I had flush_submit_sec and
> > > flush_finish_sec set to '1'. However when this started I changed them
> > > back to '0', but the problem didn't go away.
> > --> dto. Please provide more information, e.g. what does
> > qconf -sss
> > show? If qmaster doesn't get order from scheduler you will get a "no
> > scheduling host defined" answer.
> I'll try that next time the problem occurs.
> > Is the scheduler busy (at least from time to time?)
> I didn't actually check that before. Something else I'll check next
> time I see the problem.
> > > We also have this problem where a user will 'qdel' a parallel job. The
> > > job will be killed on all the nodes, however it will continue to show up
> > > in 'dr' state in qstat until I 'qdel -f' it.
> > So is it a tightly integrated parallel jobs? What type of parallel job is
> > it? MPICH?
> > Soes it mean that all child processes of the shepherd have gone away?
> > What's about the shepherd on the nodes itself?
> These are tightly integrated parallel jobs. Its using MPICH and ssh.
> On the nodes, the shepherd process is no longer running. And there are
> no processes owned by the person who ran the job.
> Here is my parallel environment setup:
> pe_name low-all
> slots 1024
> user_lists NONE
> xuser_lists NONE
> start_proc_args /usr/share/sge/mpi/startmpi.sh $pe_hostfile
> stop_proc_args /usr/share/sge/mpi/stopmpi.sh
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> Thanks for the feedback,
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users