[GE users] Long delay when submitting large jobs

Craig Tierney ctierney at hpti.com
Fri Jan 14 18:29:58 GMT 2005


I have been running SGE6.0u1 for a few months now on a new system.
I have noticed very long delays, or even SGE hangs, when starting
large jobs.  I just tried this on the latest CVS source and
the problem persists.

It appears that the hang while the job is moved from 'qw' to t.
In general the system does continue to operate normally.  However
the delays can be large, 30-60 seconds.  'Hang' is defined as
system commands like qsub and qstat will delay until the job
has finished migrating to the 't' status.  Sometimes the delays
are long enough to get GDI failures.  Since qmaster is threaded,
I wonder why I get the hangs.

I have tried debugging the situation.  Either the hang is in qmaster,
or sge_schedd is not printing enough information.

Here is some of the text from the sge_schedd debug for a 256 cpu job
using a cluster queue.

 79347   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129 R=slots U=2.000000
 79348   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130 R=slots U=2.000000
 79349   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131 R=slots U=2.000000
 79350   7886 16384     Found NOW assignment
 79351   7886 16384     reresolve port timeout in 536
 79352   7886 16384     returning cached port value: 536
scheduler tries to schedule job 179999.1 twice
 79353   7886 16384        added 0 ticket orders for queued jobs
 79354   7886 16384     SENDING 10 ORDERS TO QMASTER
 79355   7886 16384     RESETTING BUSY STATE OF EVENT CLIENT
 79356   7886 16384     reresolve port timeout in 536
 79357   7886 16384     returning cached port value: 536
 79358   7886 16384     ec_get retrieving events - will do max 3 fetches

The hang happens after line 79352.  In this instance the message
indicates the scheduler tried twice.  Other times, I get a timeout
at this point.  In either case, the output pauses in the same
manner that a call to qsub or qstat would.

I have followed the optimization procedures listed on the website
and they didn't seem to help (might have missed some though).

I don't have any information from sge_qmaster.  I tried several
different SGE_DEBUG_LEVEL settings, but sge_qmaster would always
stop providing information after daemonizing.

System configuration:

Qmaster runs on Fedora Core 2, x86, (2.2 Ghz Xeon)
clients (execd) run on Suse 9.1 x86_64, (3.2 Ghz EM64T)
SGE is configured to use old style spooling over NFS

I can provide more info, I just don't know where to go from here.

Thanks,
Craig


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list