[GE users] Long delay when submitting large jobs
ctierney at hpti.com
Fri Jan 14 18:29:58 GMT 2005
I have been running SGE6.0u1 for a few months now on a new system.
I have noticed very long delays, or even SGE hangs, when starting
large jobs. I just tried this on the latest CVS source and
the problem persists.
It appears that the hang while the job is moved from 'qw' to t.
In general the system does continue to operate normally. However
the delays can be large, 30-60 seconds. 'Hang' is defined as
system commands like qsub and qstat will delay until the job
has finished migrating to the 't' status. Sometimes the delays
are long enough to get GDI failures. Since qmaster is threaded,
I wonder why I get the hangs.
I have tried debugging the situation. Either the hang is in qmaster,
or sge_schedd is not printing enough information.
Here is some of the text from the sge_schedd debug for a 256 cpu job
using a cluster queue.
79347 7886 16384 J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129 R=slots U=2.000000
79348 7886 16384 J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130 R=slots U=2.000000
79349 7886 16384 J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131 R=slots U=2.000000
79350 7886 16384 Found NOW assignment
79351 7886 16384 reresolve port timeout in 536
79352 7886 16384 returning cached port value: 536
scheduler tries to schedule job 179999.1 twice
79353 7886 16384 added 0 ticket orders for queued jobs
79354 7886 16384 SENDING 10 ORDERS TO QMASTER
79355 7886 16384 RESETTING BUSY STATE OF EVENT CLIENT
79356 7886 16384 reresolve port timeout in 536
79357 7886 16384 returning cached port value: 536
79358 7886 16384 ec_get retrieving events - will do max 3 fetches
The hang happens after line 79352. In this instance the message
indicates the scheduler tried twice. Other times, I get a timeout
at this point. In either case, the output pauses in the same
manner that a call to qsub or qstat would.
I have followed the optimization procedures listed on the website
and they didn't seem to help (might have missed some though).
I don't have any information from sge_qmaster. I tried several
different SGE_DEBUG_LEVEL settings, but sge_qmaster would always
stop providing information after daemonizing.
Qmaster runs on Fedora Core 2, x86, (2.2 Ghz Xeon)
clients (execd) run on Suse 9.1 x86_64, (3.2 Ghz EM64T)
SGE is configured to use old style spooling over NFS
I can provide more info, I just don't know where to go from here.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users