[GE users] Long delay when submitting large jobs

Craig Tierney ctierney at hpti.com
Mon Jan 17 17:49:46 GMT 2005


On Mon, 2005-01-17 at 05:34, Stephan Grell - Sun Germany - SSG -
Software Engineer wrote:
> Hi Craig,
> 
> the qmaster is processing the scheduler start orders between lines 79352
> and 79353 and pending ticket orders between 79353 and 79355. It looks, as
> if the scheduler and clients hang, while the master is busy processing the
> start orders.
>  
> What is a large job for you?

Right now 256 processors, but we would like to launch jobs
on the whole system, ~600 processors.



>  How big is your grid? 

600 processors, (300 nodes).

> What spooling method are you using?

NFS, BDB doesn't support standard failover methods because you
can't put a BDB on NFS yet (or correct me on this).

> Can you send us the scheduler profiling output for a such a scenario?
> You can enable the profiling with "qconf -msconf" add "params profile=1".
> The scheduler message file will contain the profiling output.

Here is the profile for a 256 processor run.

01/17/2005 17:58:52|schedd|g0805|I|PROF: schedd run took: 0.560 s (init: 0.050 s, copy: 0.240 s, run:0.170, free: 0.100 s, jobs: 1, categories: 1)
01/17/2005 17:59:01|schedd|g0805|I|PROF: sge_mirror processed 9 events in 0.000 s
01/17/2005 17:59:01|schedd|g0805|I|PROF: static urgency took 0.000 s
01/17/2005 17:59:01|schedd|g0805|I|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
01/17/2005 17:59:01|schedd|g0805|I|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
01/17/2005 17:59:01|schedd|g0805|I|PROF: normalizing job tickets took 0.000 s
01/17/2005 17:59:01|schedd|g0805|I|PROF: create active job orders: 0.010 s
01/17/2005 17:59:01|schedd|g0805|I|PROF: job-order calculation took 0.040 s
01/17/2005 17:59:01|schedd|g0805|I|PROF: job sorting took 0.000 s
01/17/2005 17:59:02|schedd|g0805|I|PROF: job dispatching took 0.710 s
01/17/2005 17:59:34|schedd|g0805|I|PROF: create pending job orders: 0.000 s
01/17/2005 17:59:34|schedd|g0805|I|PROF: scheduled in 32.890 (u 0.820 + s 0.010 = 0.830): 0 sequential, 1 parallel, 4 orders, 383 H, 1108 Q, 1236 QA, 0 J(qw), 2 J(r), 0 J(s), 0 J(h), 0 J(e), 0 J(x), 2 J(all), 117 C, 1 ACL, 16 PE, 2 U, 1 D, 2 PRJ, 0 ST, 0 CKPT, 0 RU, 128 gMes, 0 jMes
01/17/2005 17:59:50|schedd|g0805|I|PROF: send orders and cleanup took: 15.130 (u 0.000,s 0.000) s
01/17/2005 17:59:51|schedd|g0805|I|PROF: schedd run took: 49.470 s (init: 0.000 s, copy: 0.220 s, run:49.160, free: 0.090 s, jobs: 2, categories: 2)
01/17/2005 17:59:51|schedd|g0805|I|PROF: profiling summary:
01/17/2005 17:59:51|schedd|g0805|I|PROF: other          : wc =      0.670s, utime =      0.000s, stime =      0.010s, utilization =   1%
01/17/2005 17:59:51|schedd|g0805|I|PROF: packing        : wc =      0.020s, utime =      0.020s, stime =      0.000s, utilization = 100%
01/17/2005 17:59:51|schedd|g0805|I|PROF: eventclient    : wc =      0.010s, utime =      0.000s, stime =      0.000s, utilization =   0%
01/17/2005 17:59:51|schedd|g0805|I|PROF: mirror         : wc =      0.000s, utime =      0.010s, stime =      0.000s, utilization =   0%
01/17/2005 17:59:51|schedd|g0805|I|PROF: gdi            : wc =     58.650s, utime =      0.040s, stime =      0.000s, utilization =   0%
01/17/2005 17:59:51|schedd|g0805|I|PROF: ht-resize      : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
01/17/2005 17:59:51|schedd|g0805|I|PROF: scheduler      : wc =      0.240s, utime =      0.200s, stime =      0.000s, utilization =  83%
01/17/2005 17:59:51|schedd|g0805|I|PROF: pending ticket : wc =      1.470s, utime =      0.000s, stime =      0.000s, utilization =   0%
01/17/2005 17:59:51|schedd|g0805|I|PROF: job sorting    : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
01/17/2005 17:59:51|schedd|g0805|I|PROF: job dispatching: wc =      0.710s, utime =      0.750s, stime =      0.010s, utilization = 107%
01/17/2005 17:59:51|schedd|g0805|I|PROF: send orders    : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
01/17/2005 17:59:51|schedd|g0805|I|PROF: scheduler event: wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
01/17/2005 17:59:51|schedd|g0805|I|PROF: copy lists     : wc =      2.150s, utime =      0.950s, stime =      0.010s, utilization =  45%
01/17/2005 17:59:51|schedd|g0805|I|PROF: total          : wc =     64.000s, utime =      1.990s, stime =      0.030s, utilization =   3%
01/17/2005 17:59:53|schedd|g0805|I|PROF: sge_mirror processed 764 events in 0.100 s
01/17/2005 17:59:53|schedd|g0805|I|PROF: static urgency took 0.000 s
01/17/2005 17:59:53|schedd|g0805|I|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
01/17/2005 17:59:53|schedd|g0805|I|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
01/17/2005 17:59:53|schedd|g0805|I|PROF: normalizing job tickets took 0.000 s
01/17/2005 17:59:53|schedd|g0805|I|PROF: create active job orders: 0.020 s
01/17/2005 17:59:53|schedd|g0805|I|PROF: job-order calculation took 0.040 s
01/17/2005 17:59:53|schedd|g0805|I|PROF: create pending job orders: 0.000 s
01/17/2005 17:59:53|schedd|g0805|I|PROF: scheduled in 0.250 (u 0.230 + s 0.000 = 0.230): 0 sequential, 0 parallel, 5 orders, 383 H, 1108 Q, 1236 QA, 0 J(qw), 2 J(r), 0 J(s), 0 J(h), 0 J(e), 0 J(x), 2 J(all), 117 C, 1 ACL, 16 PE, 2 U, 1 D, 2 PRJ, 0 ST, 0 CKPT, 0 RU, 128 gMes, 0 jMes



> 
> It looks, as if the order processing is too slow. And you are right, none
> of the performance tuning hints addresses that problem. When we have more
> information, we will know, if and how we can help you.

Qmaster has to talk to 128+ nodes.  On systems like this, it is probably
done linearly and just takes a buch of time.  The issue is, why does the
multi-threaded qmaster block when doing this?  Why can't another thread
respond to requests?  If not for all operations, at least for actions
that read the database.

Thanks,
Craig


> 
> Cheers,
> Stephan
> 
> Craig Tierney wrote:
> 
> >I have been running SGE6.0u1 for a few months now on a new system.
> >I have noticed very long delays, or even SGE hangs, when starting
> >large jobs.  I just tried this on the latest CVS source and
> >the problem persists.
> >
> >It appears that the hang while the job is moved from 'qw' to t.
> >In general the system does continue to operate normally.  However
> >the delays can be large, 30-60 seconds.  'Hang' is defined as
> >system commands like qsub and qstat will delay until the job
> >has finished migrating to the 't' status.  Sometimes the delays
> >are long enough to get GDI failures.  Since qmaster is threaded,
> >I wonder why I get the hangs.
> >
> >I have tried debugging the situation.  Either the hang is in qmaster,
> >or sge_schedd is not printing enough information.
> >
> >Here is some of the text from the sge_schedd debug for a 256 cpu job
> >using a cluster queue.
> >
> > 79347   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129 R=slots U=2.000000
> > 79348   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130 R=slots U=2.000000
> > 79349   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131 R=slots U=2.000000
> > 79350   7886 16384     Found NOW assignment
> > 79351   7886 16384     reresolve port timeout in 536
> > 79352   7886 16384     returning cached port value: 536
> >scheduler tries to schedule job 179999.1 twice
> > 79353   7886 16384        added 0 ticket orders for queued jobs
> > 79354   7886 16384     SENDING 10 ORDERS TO QMASTER
> > 79355   7886 16384     RESETTING BUSY STATE OF EVENT CLIENT
> > 79356   7886 16384     reresolve port timeout in 536
> > 79357   7886 16384     returning cached port value: 536
> > 79358   7886 16384     ec_get retrieving events - will do max 3 fetches
> >
> >The hang happens after line 79352.  In this instance the message
> >indicates the scheduler tried twice.  Other times, I get a timeout
> >at this point.  In either case, the output pauses in the same
> >manner that a call to qsub or qstat would.
> >
> >I have followed the optimization procedures listed on the website
> >and they didn't seem to help (might have missed some though).
> >
> >I don't have any information from sge_qmaster.  I tried several
> >different SGE_DEBUG_LEVEL settings, but sge_qmaster would always
> >stop providing information after daemonizing.
> >
> >System configuration:
> >
> >Qmaster runs on Fedora Core 2, x86, (2.2 Ghz Xeon)
> >clients (execd) run on Suse 9.1 x86_64, (3.2 Ghz EM64T)
> >SGE is configured to use old style spooling over NFS
> >
> >I can provide more info, I just don't know where to go from here.
> >
> >Thanks,
> >Craig
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list