[GE users] Qsub strange behaviours

spow_ miomax_ at hotmail.com
Tue Jul 27 13:13:49 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,

reuti a écrit :

the syntax needs to be revised:

qsub [ options ] [ command [ command_args ]]

i.e. options come fist, then the command, the options to the command/script last.


Yep, my bad.


Which doesn't send the job where it is supposed to. It picks a queue randomly.
While reading the manpages, I found the -hard option, but it doesn't work either.


- Also, I use subordinates. The problem is that only part of the queue get suspended ! (e.g. my queue sub1 runs across 2 hosts, and only the sub1 at host1 gets suspended, whereas it is very clear the whole queue should be suspended).


How is the subordination defined in the queue setup (`qconf -sq ...`)?


H2 is a parallel queue to whom L2 is a subordinate. I truncated parts (none/infinity) of the following  configuration :

qconf -sq H2
qname                 H2
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 NONE
ckpt_list             NONE
pe_list               make mpi
rerun                 FALSE
slots                 8
epilog                NONE
shell_start_mode      posix_compliant
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      L2=5

I have the exact same H1 and L1 queues (clones) and they have the same problem.
The L queues span 2 nodes. If a job is running on node 1 and L gets suspended, node 1 is suspended. However, if a job is dispatched on node 2, it keeps running. So it looks like subordination only happens on one node (I can't test with more nodes, my test server only has 2 of them)



- Eventually, I ran parallel jobs, with $round_robin allocation. If I submit a limited number of jobs, they get correctly dispatched.
But if a few jobs are already running in the parallel queues,


Why do you have many parallel queue? The idea behind SGE is to specify resource requests, and SGE will select an appropriate queue for your. It's not like Torque, where you submit into a queue.


Well, I have much parallel queues because parallel jobs are unequal in size and number.
Therefore, a parallel queue spans from node 1 to x. This interval is further subdivided in parallel queues (e.g. 1*12, 2*6, 4*3 ...), with sequence numbers decreasing from node 1.


More information about the gridengine-users mailing list