[GE users] scheduler bottleneck?

Bryan Bayerdorffer bryan.bayerdorffer at analog.com
Wed Mar 31 15:09:30 BST 2004


5.3p4.  After my changes, the time for a scheduling run dropped to about 15sec:

Tue Mar 30 18:16:52 2004|schedd|hai7|I|PROF: SGEEE job ticket calculation: 
init: 0.280 s, pass 0: 0.170 s, pass 1: 0.000, pass2: 0.000, calc: 0.310 s
Tue Mar 30 18:16:52 2004|schedd|hai7|I|PROF: SGEEE job ticket calculation: 
init: 0.010 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
Tue Mar 30 18:16:53 2004|schedd|hai7|I|PROF: SGEEE update orders: job orders: 
0.770 s, update orders: 0.030 s
Tue Mar 30 18:16:53 2004|schedd|hai7|I|PROF: SGEEE pending job ticket 
calculation took 1.590 s
Tue Mar 30 18:16:53 2004|schedd|hai7|I|PROF: SGEEE active job ticket 
calculation took 0.030 s
Tue Mar 30 18:16:53 2004|schedd|hai7|I|PROF: SGEEE job sorting took 0.140 s
Tue Mar 30 18:16:58 2004|schedd|hai7|I|PROF: SGEEE job dispatching took 5.130 s
Tue Mar 30 18:16:58 2004|schedd|hai7|I|PROF: scheduled in 7.360 (u 5.640 + s 
0.000 = 5.640): 8 fast, 0 complex, 2561 orders, 80 H, 268 Q, 621 QA, 0 J(qw), 
54 J(r), 0 J(s), 0 J(h), 0 J(e), 8 J(x), 2556 J(all) 4 C, 1 ACL, 1 PE, 1 CONF, 
116 U, 1 D, 0 PRJ, 1 ST, 0 CKPT, 0 RU
Tue Mar 30 18:17:05 2004|schedd|hai7|I|PROF: generate and send orders took: 
6.150 s
Tue Mar 30 18:17:05 2004|schedd|hai7|I|PROF: schedd run took: 15.480 s (copying
the lists took: 1.370 s)


Andy Schwierskott wrote:
> Bryan,
> 
> which SGE version are you using?
> 
> do you have parallel jobs running? Are they requesting a PE range (like
> "-pe xyz 4-16"
> 
> What is the profile output now?
> 
> Andy
> 
> 
>>Thanks for the pointer.  I did some of the things suggested by the tuning
>>guide and set the FLUSH_ params to 0, and things appear better now.
>>
>>
>>Charu Chaubal wrote:
>>
>>>Hi Bryan,
>>>
>>>One common approach when you have many jobs on the order of a few
>>>seconds is to increase the number of slots per host.  This way, you can
>>>effectively package several jobs together.  It's true that sometimes you
>>>might have two jobs on the same CPU, but, on average, the hope is, the
>>>jobs will run such that no two jobs overlap for very long.
>>>
>>>I assume you've looked at the tuning guide here:
>>>
>>>http://gridengine.sunsource.net/project/gridengine/howto/tuning.html
>>>
>>>Regards,
>>>    Charu
>>>
>>>
>>>Bryan Bayerdorffer wrote:
>>>
>>>
>>>>I don't know if this is related to my earlier unresolved problem
>>>>(http://gridengine.sunsource.net/servlets/BrowseList?listName=users&by=thread&from=1703).
>>>>
>>>>
>>>>Right now we have a situation in which there are about 3000 pending
>>>>jobs being dispatched to ~50 exec hosts (1 slot each).  The majority
>>>>of these jobs have *extremely* short runtimes---just a few seconds.
>>>>The result is that many hosts are idle for a long time (a minute or
>>>>so) waiting for new jobs to be dispatched.  Users are complaining
>>>>because the total throughput for this job mix is a lot lower than it
>>>>was with LSF.  I'm wondering if the SGEEE scheduler is a bottleneck
>>>>here.  I have the schedule interval set to 10 seconds.  I enabled
>>>>profiling, and it seems that each scheduling run takes about 45
>>>>seconds.  This is on a 450MHz Ultra 60 with local /var/spool/sge, the
>>>>same host that used to run the LSF master.
>>>>
>>>>Anything I can tune to improve the performance for short jobs?  I've
>>>>thought of packaging several small jobs as one, but that would require
>>>>big changes in the way batch submission is scripted, and it's also
>>>>somewhat difficult to predict the runtime.
>>>>
>>>>What's "generate and send orders?"
>>>>
>>>>Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE job ticket
>>>>calculation: init: 0.320 s, pass 0: 0.180 s, pass 1: 0.000, pass2:
>>>>0.000, calc: 0.350 s
>>>>Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE job ticket
>>>>calculation: init: 0.010 s, pass 0: 0.010 s, pass 1: 0.000, pass2:
>>>>0.000, calc: 0.000 s
>>>>Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE update orders: job
>>>>orders: 0.590 s, update orders: 0.030 s
>>>>Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE pending job ticket
>>>>calculation took 1.500 s
>>>>Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE active job ticket
>>>>calculation took 0.020 s
>>>>Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE job sorting took
>>>>0.160 s
>>>>Tue Mar 30 17:21:28 2004|schedd|hai7|I|PROF: SGEEE job dispatching
>>>>took 8.430 s
>>>>Tue Mar 30 17:21:28 2004|schedd|hai7|I|PROF: scheduled in 10.600 (u
>>>>10.400 + s 0.000 = 10.400): 8 fast, 0 complex, 2817 orders, 80 H, 267
>>>>Q, 621 QA, 0 J(qw), 53 J(r), 0 J(s), 0 J(h), 0 J(e), 8 J(x), 2812
>>>>J(all) 4 C, 1 ACL, 1 PE, 1 CONF, 116 U, 1 D, 0 PRJ, 1 ST, 0 CKPT, 0 RU
>>>>Tue Mar 30 17:22:00 2004|schedd|hai7|I|PROF: generate and send orders
>>>>took: 32.020 s
>>>>Tue Mar 30 17:22:01 2004|schedd|hai7|I|PROF: schedd run took: 44.570 s
>>>>(copying
>>>>the lists took: 1.400 s)
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

-- 
  .. ..-. ..- -.-. .- -. .-. . .- -.. - .... .. ... --. . - .- .-.. .. ..-. .
Bryan Bayerdorffer        bryan at meatspace.net           bryan at spd.analog.com
                      (Wit's End Computation Center)       (Analog Devices)

"All our ideas of life after death come from people speculating who
have never bothered actually to die.  Once people do die they seem
disinclined to share their post mortem experiences."
-- Mark R. Leeper

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list