[GE users] Long delay when submitting large jobs

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon Feb 14 13:47:48 GMT 2005


Hello Craig,

I took a look at the problem, filed a bug for it and I think, I fixed
it. At least the startup is much faster now in my env.

If you have the time, can you test the fix? I modified the maintrunk.

Issue: 1461

Cheers,
Stephan

Craig Tierney wrote:

>I have been running SGE6.0u1 for a few months now on a new system.
>I have noticed very long delays, or even SGE hangs, when starting
>large jobs.  I just tried this on the latest CVS source and
>the problem persists.
>
>It appears that the hang while the job is moved from 'qw' to t.
>In general the system does continue to operate normally.  However
>the delays can be large, 30-60 seconds.  'Hang' is defined as
>system commands like qsub and qstat will delay until the job
>has finished migrating to the 't' status.  Sometimes the delays
>are long enough to get GDI failures.  Since qmaster is threaded,
>I wonder why I get the hangs.
>
>I have tried debugging the situation.  Either the hang is in qmaster,
>or sge_schedd is not printing enough information.
>
>Here is some of the text from the sge_schedd debug for a 256 cpu job
>using a cluster queue.
>
> 79347   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129 R=slots U=2.000000
> 79348   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130 R=slots U=2.000000
> 79349   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131 R=slots U=2.000000
> 79350   7886 16384     Found NOW assignment
> 79351   7886 16384     reresolve port timeout in 536
> 79352   7886 16384     returning cached port value: 536
>scheduler tries to schedule job 179999.1 twice
> 79353   7886 16384        added 0 ticket orders for queued jobs
> 79354   7886 16384     SENDING 10 ORDERS TO QMASTER
> 79355   7886 16384     RESETTING BUSY STATE OF EVENT CLIENT
> 79356   7886 16384     reresolve port timeout in 536
> 79357   7886 16384     returning cached port value: 536
> 79358   7886 16384     ec_get retrieving events - will do max 3 fetches
>
>The hang happens after line 79352.  In this instance the message
>indicates the scheduler tried twice.  Other times, I get a timeout
>at this point.  In either case, the output pauses in the same
>manner that a call to qsub or qstat would.
>
>I have followed the optimization procedures listed on the website
>and they didn't seem to help (might have missed some though).
>
>I don't have any information from sge_qmaster.  I tried several
>different SGE_DEBUG_LEVEL settings, but sge_qmaster would always
>stop providing information after daemonizing.
>
>System configuration:
>
>Qmaster runs on Fedora Core 2, x86, (2.2 Ghz Xeon)
>clients (execd) run on Suse 9.1 x86_64, (3.2 Ghz EM64T)
>SGE is configured to use old style spooling over NFS
>
>I can provide more info, I just don't know where to go from here.
>
>Thanks,
>Craig
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list