[GE users] Qsub strange behaviours

spow_ miomax_ at hotmail.com
Thu Jul 29 08:50:46 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,

Sorry for the late answer, I had other matters to deal with yesterday.

> >> <snip>
> > However, if a job is dispatched on node 2, it keeps running. So it looks like subordination only happens on one node (I can't test with more nodes, my test server only has 2 of them)
> >>> - Eventually, I ran parallel jobs, with $round_robin allocation. If I submit a limited number of jobs, they get correctly dispatched.
> >>> But if a few jobs are already running in the parallel queues,
> >>>
> >>
> >> Why do you have many parallel queue? The idea behind SGE is to specify resource requests, and SGE will select an appropriate queue for your. It's not like Torque, where you submit into a queue.
> >>
> > Well, I have much parallel queues because parallel jobs are unequal in size and number.
> > Therefore, a parallel queue spans from node 1 to x. This interval is further subdivided in parallel queues (e.g. 1*12, 2*6, 4*3 ...)
>
> You mean PEs with fixed allocation rules?

I'm not 100% sure what fixed allocation rule is, my best guess is when you give an integer in the allocation rule of the PE representing the max number of allocated slots, in which case I don't use fixed AR. I use $round_robin (more on that below)

What I meant was that I have a big parallel queue spanning across all hosts, but used rarely.
2 parallel queues represent half of the big queue, used more often.
Eventually, 4 other queues represent 1/4th of the big queue.
It 'looks like' this (where '=' is a node) :

==============  P1
=======  ======  P2 & P3
=== ===  === ===   P4 & P5 & P6 & P7

There also are sequential (batch) subordinated queues running on these nodes, symetrical to those above.
Users will mostly use P4-7, but queues P1-P3 can be used in case there are bigger jobs needing faster compute time.


My allocation rule is $round_robin, I think it's the best to use here because users will only use MPI for a while (until they change their code to allow OpenMP integration, and someone else will tweak what I did here to allow users to use OpenMP + MPI at the same time), and I base this assumption on this table :
http://www.hpccommunity.org/f55/multi-core-strategies-mpi-openmp-702/
They state that MPI runs much better when dispatched on many nodes, rather than on the same node, which actually is OpenMP's job. So $round_robin seems to be the way to go.

My actual problem with $round_robin is that if there are several parallel jobs running on the 2 hosts (my test farm only has 2 nodes), the latter one (and any follower) will get dispatched on only one host, whereas it could potentially get dispatched on 2 hosts in terms of free slots (because job N goes to node 1 and job N+1 goes to node 2)

I tried to change the scheduler configuration to not use load_avg to dispatch jobs, but it still has the same behaviour.
Do I have to use complex(es) to make sure an MPI-parallel job always gets dispatched on 2+ hosts ?

Thanks,
GQ




More information about the gridengine-users mailing list