[GE users] Qsub strange behaviours

reuti reuti at staff.uni-marburg.de
Tue Jul 27 16:13:18 BST 2010


Am 27.07.2010 um 14:13 schrieb spow_:

>> <snip>
>> How is the subordination defined in the queue setup (`qconf -sq ...`)?
>>   
> H2 is a parallel queue to whom L2 is a subordinate. I truncated parts (none/infinity) of the following  configuration :
> 
> qconf -sq H2
> qname                 H2
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 NONE
> ckpt_list             NONE
> pe_list               make mpi
> rerun                 FALSE
> slots                 8
> epilog                NONE
> shell_start_mode      posix_compliant
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      L2=5
> 
> I have the exact same H1 and L1 queues (clones) and they have the same problem.
> The L queues span 2 nodes. If a job is running on node 1 and L gets suspended, node 1 is suspended.

Yep, this is the way it works.


> However, if a job is dispatched on node 2, it keeps running. So it looks like subordination only happens on one node (I can't test with more nodes, my test server only has 2 of them)
>>> - Eventually, I ran parallel jobs, with $round_robin allocation. If I submit a limited number of jobs, they get correctly dispatched.
>>> But if a few jobs are already running in the parallel queues,
>>>     
>> 
>> Why do you have many parallel queue? The idea behind SGE is to specify resource requests, and SGE will select an appropriate queue for your. It's not like Torque, where you submit into a queue.
>>   
> Well, I have much parallel queues because parallel jobs are unequal in size and number.
> Therefore, a parallel queue spans from node 1 to x. This interval is further subdivided in parallel queues (e.g. 1*12, 2*6, 4*3 ...)

You mean PEs with fixed allocation rules?


> , with sequence numbers decreasing from node 1.
> From the node on the far right, I have sequential queues which sequence numbers also decrease, in the opposite direction. This way parallel and sequential jobs won't run on the same nodes until the cluster is somewhat full. If parallel jobs get dispatched on nodes where sequential jobs are running, the sequential queue gets suspended. The cron co-scheduler we spoke about 2 weeks ago then kills the sequential jobs to re-run them somewhere else.
> 
> Which means for the user that he can either trust SGE to dispatch the job, or specify the queue to have a chance never to get suspended.
> 
> Does it sound like a strange/bad configuration to you ?

Yes, filling the cluster from both sides is a good approach.


-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270676

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list