[GE users] Long delay when submitting large jobs

Craig Tierney ctierney at hpti.com
Fri Jan 14 18:47:40 GMT 2005


On Fri, 2005-01-14 at 11:51, Sean Dilda wrote:
> Craig Tierney wrote:
> > I have been running SGE6.0u1 for a few months now on a new system.
> > I have noticed very long delays, or even SGE hangs, when starting
> > large jobs.  I just tried this on the latest CVS source and
> > the problem persists.
> > 
> > It appears that the hang while the job is moved from 'qw' to t.
> > In general the system does continue to operate normally.  However
> > the delays can be large, 30-60 seconds.  'Hang' is defined as
> > system commands like qsub and qstat will delay until the job
> > has finished migrating to the 't' status.  Sometimes the delays
> > are long enough to get GDI failures.  Since qmaster is threaded,
> > I wonder why I get the hangs.
> 
> I've seen similar things.  It's even worse if you try to qdel a large 
> job like that.   Like you, I have an entirely linux-based cluster and am 
> using classic spooling over NFS.   At this point, my best guess is that 
> its slow downs with the spooling module combined with sub-optimal use of 
> threads.  However, I haven't gotten much beyond that.

Qmaster or the scheduler have already picked the nodes to use.
My guess is that the server has a mutex engaged while talking
to each of the nodes during startup, or that during database writes,
it is exclusively locked even for readers.  I couldn't figure out
how to get debugging from qmaster to verify.

I can see why a qsub or a qdel could hang during the transition
period.  However, I really would like qstat to return.

Craig


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list