[GE users] scheduler bottleneck?

Bryan Bayerdorffer bryan.bayerdorffer at analog.com
Wed Mar 31 00:57:28 BST 2004


I don't know if this is related to my earlier unresolved problem 
(http://gridengine.sunsource.net/servlets/BrowseList?listName=users&by=thread&from=1703). 


Right now we have a situation in which there are about 3000 pending jobs being 
dispatched to ~50 exec hosts (1 slot each).  The majority of these jobs have 
*extremely* short runtimes---just a few seconds.  The result is that many 
hosts are idle for a long time (a minute or so) waiting for new jobs to be 
dispatched.  Users are complaining because the total throughput for this job 
mix is a lot lower than it was with LSF.  I'm wondering if the SGEEE scheduler 
is a bottleneck here.  I have the schedule interval set to 10 seconds.  I 
enabled profiling, and it seems that each scheduling run takes about 45 
seconds.  This is on a 450MHz Ultra 60 with local /var/spool/sge, the same 
host that used to run the LSF master.

Anything I can tune to improve the performance for short jobs?  I've thought 
of packaging several small jobs as one, but that would require big changes in 
the way batch submission is scripted, and it's also somewhat difficult to 
predict the runtime.

What's "generate and send orders?"

Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE job ticket calculation: 
init: 0.320 s, pass 0: 0.180 s, pass 1: 0.000, pass2: 0.000, calc: 0.350 s
Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE job ticket calculation: 
init: 0.010 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE update orders: job orders: 
0.590 s, update orders: 0.030 s
Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE pending job ticket 
calculation took 1.500 s
Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE active job ticket 
calculation took 0.020 s
Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE job sorting took 0.160 s
Tue Mar 30 17:21:28 2004|schedd|hai7|I|PROF: SGEEE job dispatching took 8.430 s
Tue Mar 30 17:21:28 2004|schedd|hai7|I|PROF: scheduled in 10.600 (u 10.400 + s 
0.000 = 10.400): 8 fast, 0 complex, 2817 orders, 80 H, 267 Q, 621 QA, 0 J(qw), 
53 J(r), 0 J(s), 0 J(h), 0 J(e), 8 J(x), 2812 J(all) 4 C, 1 ACL, 1 PE, 1 CONF, 
116 U, 1 D, 0 PRJ, 1 ST, 0 CKPT, 0 RU
Tue Mar 30 17:22:00 2004|schedd|hai7|I|PROF: generate and send orders took: 
32.020 s
Tue Mar 30 17:22:01 2004|schedd|hai7|I|PROF: schedd run took: 44.570 s (copying
the lists took: 1.400 s)

-- 
  .. ..-. ..- -.-. .- -. .-. . .- -.. - .... .. ... --. . - .- .-.. .. ..-. .
Bryan Bayerdorffer       bryan at meatspace.net              bryan at spd.analog.com
                    (Wit's End Computation Center)           (Analog Devices)

"This isn't right.  This isn't even wrong."
                                                                  -- Hans Bethe

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list