[GE users] Long delay when submitting large jobs

Craig Tierney ctierney at hpti.com
Mon Feb 14 13:31:21 GMT 2005


On Mon, 2005-02-14 at 06:47, Stephan Grell - Sun Germany - SSG -
Software Engineer wrote:
> Hello Craig,
> 
> I took a look at the problem, filed a bug for it and I think, I fixed
> it. At least the startup is much faster now in my env.
> 
> If you have the time, can you test the fix? I modified the maintrunk.
> 
> Issue: 1461
> 
> Cheers,
> Stephan
> 

Thanks.  Our downtime was delayed, so I will get something built
for Wednesday off the maintrunk and test it then.  I will let you
know.

Craig


> Craig Tierney wrote:
> 
> >I have been running SGE6.0u1 for a few months now on a new system.
> >I have noticed very long delays, or even SGE hangs, when starting
> >large jobs.  I just tried this on the latest CVS source and
> >the problem persists.
> >
> >It appears that the hang while the job is moved from 'qw' to t.
> >In general the system does continue to operate normally.  However
> >the delays can be large, 30-60 seconds.  'Hang' is defined as
> >system commands like qsub and qstat will delay until the job
> >has finished migrating to the 't' status.  Sometimes the delays
> >are long enough to get GDI failures.  Since qmaster is threaded,
> >I wonder why I get the hangs.
> >
> >I have tried debugging the situation.  Either the hang is in qmaster,
> >or sge_schedd is not printing enough information.
> >
> >Here is some of the text from the sge_schedd debug for a 256 cpu job
> >using a cluster queue.
> >
> > 79347   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129 R=slots U=2.000000
> > 79348   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130 R=slots U=2.000000
> > 79349   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131 R=slots U=2.000000
> > 79350   7886 16384     Found NOW assignment
> > 79351   7886 16384     reresolve port timeout in 536
> > 79352   7886 16384     returning cached port value: 536
> >scheduler tries to schedule job 179999.1 twice
> > 79353   7886 16384        added 0 ticket orders for queued jobs
> > 79354   7886 16384     SENDING 10 ORDERS TO QMASTER
> > 79355   7886 16384     RESETTING BUSY STATE OF EVENT CLIENT
> > 79356   7886 16384     reresolve port timeout in 536
> > 79357   7886 16384     returning cached port value: 536
> > 79358   7886 16384     ec_get retrieving events - will do max 3 fetches
> >
> >The hang happens after line 79352.  In this instance the message
> >indicates the scheduler tried twice.  Other times, I get a timeout
> >at this point.  In either case, the output pauses in the same
> >manner that a call to qsub or qstat would.
> >
> >I have followed the optimization procedures listed on the website
> >and they didn't seem to help (might have missed some though).
> >
> >I don't have any information from sge_qmaster.  I tried several
> >different SGE_DEBUG_LEVEL settings, but sge_qmaster would always
> >stop providing information after daemonizing.
> >
> >System configuration:
> >
> >Qmaster runs on Fedora Core 2, x86, (2.2 Ghz Xeon)
> >clients (execd) run on Suse 9.1 x86_64, (3.2 Ghz EM64T)
> >SGE is configured to use old style spooling over NFS
> >
> >I can provide more info, I just don't know where to go from here.
> >
> >Thanks,
> >Craig
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list