[GE users] Qsub strange behaviours

reuti reuti at staff.uni-marburg.de
Thu Jul 29 19:24:15 BST 2010

    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 29.07.2010 um 13:33 schrieb spow_:

> Hi,
> reuti a écrit :
>>> 2 parallel queues represent half of the big queue, used more often.
>>> Eventually, 4 other queues represent 1/4th of the big queue.
>>> It 'looks like' this (where '=' is a node) :
>>> ==============  P1
>>> =======  ======  P2 & P3
>>> === ===  === ===   P4 & P5 & P6 & P7

Thx, but: ...

>> Why do you have so many queues? You could just stay even with one parallel queue.
> My scheme relies on subordinates and sequence numbers. It also relies on the fact that the user knows (at least should know) how much nodes he wants to use (max, average, mini)
> P1 is very rarely used. It allows a user to submit an exceptionally big parallel computation which will stop anything running (it actually covers more hosts onto the right, which are used for sequential jobs on a regular basis)
> P2 & P3 are used for heavy parallel jobs. (P2 seq n° > P3 seq n°)
> P4-7 are used on a daily basis for regular-sized parallel jobs. (same as above for seq n°)

I still don't see the need for so many queues. Do you request also the desired queue in your qsub command? This might lead to the situation, that P2 has jobs waiting, while P3 is idling. What you need to keep jobs inside a pool of machines are multiple PEs, which you request with a wildcard, and bind each of them only to to a hostgroup (still in one and the same queue). Once a hostgroup is selected by SGE, it will use slots only from the machines in this hostgroup.

pe_list NONE,[@allhosts=mpi_all],[@part_a=mpi_high_a],[@part_b=mpi_high_b],[@sub_a=mpi_low_a],[@sub_b=mpi_low_b],[@sub_c=mpi_low_c],[@sub_d=mpi_low_d]

and submit:

$ qsub -pe mpi_low* 4 job.sh

You will this way of course replace the many queues with many PEs, but it could ease the usage as you don't have to decide which machines to use. Maybe it's also just the personal taste of mine, that I want to have as few queues as possible in a cluster. When you have a working setup, which satisfys your needs, you can keep it.

> This allows SGE to fill up nodes from the left with parallel jobs, and to fill up nodes from the right with sequential jobs. (any suspended sequential job is killed and I cannot introduce checkpointing because the users cannot modify their code, even with Condor)

But you can use a checkpointing interface anyway. This way a it can be set up to reschedule a job when it gets suspended. I can't find the original thread of the discussion, which led to my suggestion to use a cron job for this purpose. Of course, a cron job doesn't make it necessary to add a checkpoint request, but this could be added by a JSV unconditionally.

>>> <snip>
>>> So $round_robin seems to be the way to go.
>>> My actual problem with $round_robin is that if there are several parallel jobs running on the 2 hosts (my test farm only has 2 nodes), the latter one (and any follower) will get dispatched on only one host, whereas it could potentially get dispatched on 2 hosts in terms of free slots (because job N goes to node 1 and job N+1 goes to node 2)
>>> I tried to change the scheduler configuration to not use load_avg to dispatch jobs, but it still has the same behaviour.
>> np_load_avg is the default. Is there already something running on these nodes - do you request any resource like memory?
> I don't request any resources yet (mem_free, ram_free ...), except for slots ofc. The problem appears when there already are parallel jobs running on both the hosts, and if there only are one/two free slot per host (or some variation)

This is strange. For me $round_robin always distributes the tasks to different nodes, independent from the load and of running jobs.

> , SGE will chose to dispatch the newly submitted parallel job on only one host. This can hardly be called a parallel job IMHO ^^

Of course, a job can be parallel on one and the same node, and it's common also to speak in such cases from a parallel job. It's just referring to the fact, that more than one process or thread is involved. Whether you oversubscribe a core and/or uses only one or multiple nodes doesn't matter.

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list