[GE users] scheduler bottleneck?

Andy Schwierskott andy.schwierskott at sun.com
Wed Mar 31 10:03:19 BST 2004


Bryan,

which SGE version are you using?

do you have parallel jobs running? Are they requesting a PE range (like
"-pe xyz 4-16"

What is the profile output now?

Andy

> Thanks for the pointer.  I did some of the things suggested by the tuning
> guide and set the FLUSH_ params to 0, and things appear better now.
>
>
> Charu Chaubal wrote:
> > Hi Bryan,
> >
> > One common approach when you have many jobs on the order of a few
> > seconds is to increase the number of slots per host.  This way, you can
> > effectively package several jobs together.  It's true that sometimes you
> > might have two jobs on the same CPU, but, on average, the hope is, the
> > jobs will run such that no two jobs overlap for very long.
> >
> > I assume you've looked at the tuning guide here:
> >
> > http://gridengine.sunsource.net/project/gridengine/howto/tuning.html
> >
> > Regards,
> >     Charu
> >
> >
> > Bryan Bayerdorffer wrote:
> >
> >> I don't know if this is related to my earlier unresolved problem
> >> (http://gridengine.sunsource.net/servlets/BrowseList?listName=users&by=thread&from=1703).
> >>
> >>
> >> Right now we have a situation in which there are about 3000 pending
> >> jobs being dispatched to ~50 exec hosts (1 slot each).  The majority
> >> of these jobs have *extremely* short runtimes---just a few seconds.
> >> The result is that many hosts are idle for a long time (a minute or
> >> so) waiting for new jobs to be dispatched.  Users are complaining
> >> because the total throughput for this job mix is a lot lower than it
> >> was with LSF.  I'm wondering if the SGEEE scheduler is a bottleneck
> >> here.  I have the schedule interval set to 10 seconds.  I enabled
> >> profiling, and it seems that each scheduling run takes about 45
> >> seconds.  This is on a 450MHz Ultra 60 with local /var/spool/sge, the
> >> same host that used to run the LSF master.
> >>
> >> Anything I can tune to improve the performance for short jobs?  I've
> >> thought of packaging several small jobs as one, but that would require
> >> big changes in the way batch submission is scripted, and it's also
> >> somewhat difficult to predict the runtime.
> >>
> >> What's "generate and send orders?"
> >>
> >> Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE job ticket
> >> calculation: init: 0.320 s, pass 0: 0.180 s, pass 1: 0.000, pass2:
> >> 0.000, calc: 0.350 s
> >> Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE job ticket
> >> calculation: init: 0.010 s, pass 0: 0.010 s, pass 1: 0.000, pass2:
> >> 0.000, calc: 0.000 s
> >> Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE update orders: job
> >> orders: 0.590 s, update orders: 0.030 s
> >> Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE pending job ticket
> >> calculation took 1.500 s
> >> Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE active job ticket
> >> calculation took 0.020 s
> >> Tue Mar 30 17:21:19 2004|schedd|hai7|I|PROF: SGEEE job sorting took
> >> 0.160 s
> >> Tue Mar 30 17:21:28 2004|schedd|hai7|I|PROF: SGEEE job dispatching
> >> took 8.430 s
> >> Tue Mar 30 17:21:28 2004|schedd|hai7|I|PROF: scheduled in 10.600 (u
> >> 10.400 + s 0.000 = 10.400): 8 fast, 0 complex, 2817 orders, 80 H, 267
> >> Q, 621 QA, 0 J(qw), 53 J(r), 0 J(s), 0 J(h), 0 J(e), 8 J(x), 2812
> >> J(all) 4 C, 1 ACL, 1 PE, 1 CONF, 116 U, 1 D, 0 PRJ, 1 ST, 0 CKPT, 0 RU
> >> Tue Mar 30 17:22:00 2004|schedd|hai7|I|PROF: generate and send orders
> >> took: 32.020 s
> >> Tue Mar 30 17:22:01 2004|schedd|hai7|I|PROF: schedd run took: 44.570 s
> >> (copying
> >> the lists took: 1.400 s)

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list