[GE users] Qsub strange behaviours

spow_ miomax_ at hotmail.com
Thu Jul 29 12:33:19 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

reuti a écrit :

2 parallel queues represent half of the big queue, used more often.
Eventually, 4 other queues represent 1/4th of the big queue.
It 'looks like' this (where '=' is a node) :

==============  P1
=======  ======  P2 & P3
=== ===  === ===   P4 & P5 & P6 & P7

Why do you have so many queues? You could just stay even with one parallel queue.

My scheme relies on subordinates and sequence numbers. It also relies on the fact that the user knows (at least should know) how much nodes he wants to use (max, average, mini)
P1 is very rarely used. It allows a user to submit an exceptionally big parallel computation which will stop anything running (it actually covers more hosts onto the right, which are used for sequential jobs on a regular basis)
P2 & P3 are used for heavy parallel jobs. (P2 seq n° > P3 seq n°)
P4-7 are used on a daily basis for regular-sized parallel jobs. (same as above for seq n°)

This allows SGE to fill up nodes from the left with parallel jobs, and to fill up nodes from the right with sequential jobs. (any suspended sequential job is killed and I cannot introduce checkpointing because the users cannot modify their code, even with Condor)


There also are sequential (batch) subordinated queues running on these nodes, symetrical to those above.
Users will mostly use P4-7, but queues P1-P3 can be used in case there are bigger jobs needing faster compute time.


My allocation rule is $round_robin, I think it's the best to use here because users will only use MPI for a while (until they change their code to allow OpenMP integration, and someone else will tweak what I did here to allow users to use OpenMP + MPI at the same time), and I base this assumption on this table :
http://www.hpccommunity.org/f55/multi-core-strategies-mpi-openmp-702/
They state that MPI runs much better when dispatched on many nodes, rather than on the same node, which actually is OpenMP's job.

I read it more in they way, that it depends on the kind of application. One difference is, that all OpenMP threads share the same memory, while MPI processes use their own. Then it's a matter of communication: does you start the MPI tasks and collect just the results after hours of (local) computing, or is there heavy communication involved (where you also have to start to think about using InfiniBand instead of Ethernet)? This is also stated on the page you mentioned two paragraphs before the table: "The second assumption is that MPI programs must be spread across multiple nodes in order to run effectively. As Table One demonstrates, neither of the assumptions hold true 100% of the time. " 1)

It does indeed state that MPI vs OMP comparison is really a matter of which algorithm you use. But in most cases MPI computes faster when spread across multiple hosts rather than when it fills up one host.
Unfortunately I am unable to answer your question on communication yet, the concerned user being in vacation.


So $round_robin seems to be the way to go.

My actual problem with $round_robin is that if there are several parallel jobs running on the 2 hosts (my test farm only has 2 nodes), the latter one (and any follower) will get dispatched on only one host, whereas it could potentially get dispatched on 2 hosts in terms of free slots (because job N goes to node 1 and job N+1 goes to node 2)

I tried to change the scheduler configuration to not use load_avg to dispatch jobs, but it still has the same behaviour.

np_load_avg is the default. Is there already something running on these nodes - do you request any resource like memory?

I don't request any resources yet (mem_free, ram_free ...), except for slots ofc. The problem appears when there already are parallel jobs running on both the hosts, and if there only are one/two free slot per host (or some variation), SGE will chose to dispatch the newly submitted parallel job on only one host. This can hardly be called a parallel job IMHO ^^




More information about the gridengine-users mailing list