[GE users] Qsub strange behaviours
reuti at staff.uni-marburg.de
Tue Jul 27 16:13:18 BST 2010
Am 27.07.2010 um 14:13 schrieb spow_:
>> How is the subordination defined in the queue setup (`qconf -sq ...`)?
> H2 is a parallel queue to whom L2 is a subordinate. I truncated parts (none/infinity) of the following configuration :
> qconf -sq H2
> qname H2
> hostlist @allhosts
> seq_no 0
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> qtype NONE
> ckpt_list NONE
> pe_list make mpi
> rerun FALSE
> slots 8
> epilog NONE
> shell_start_mode posix_compliant
> notify 00:00:60
> owner_list NONE
> user_lists NONE
> xuser_lists NONE
> subordinate_list L2=5
> I have the exact same H1 and L1 queues (clones) and they have the same problem.
> The L queues span 2 nodes. If a job is running on node 1 and L gets suspended, node 1 is suspended.
Yep, this is the way it works.
> However, if a job is dispatched on node 2, it keeps running. So it looks like subordination only happens on one node (I can't test with more nodes, my test server only has 2 of them)
>>> - Eventually, I ran parallel jobs, with $round_robin allocation. If I submit a limited number of jobs, they get correctly dispatched.
>>> But if a few jobs are already running in the parallel queues,
>> Why do you have many parallel queue? The idea behind SGE is to specify resource requests, and SGE will select an appropriate queue for your. It's not like Torque, where you submit into a queue.
> Well, I have much parallel queues because parallel jobs are unequal in size and number.
> Therefore, a parallel queue spans from node 1 to x. This interval is further subdivided in parallel queues (e.g. 1*12, 2*6, 4*3 ...)
You mean PEs with fixed allocation rules?
> , with sequence numbers decreasing from node 1.
> From the node on the far right, I have sequential queues which sequence numbers also decrease, in the opposite direction. This way parallel and sequential jobs won't run on the same nodes until the cluster is somewhat full. If parallel jobs get dispatched on nodes where sequential jobs are running, the sequential queue gets suspended. The cron co-scheduler we spoke about 2 weeks ago then kills the sequential jobs to re-run them somewhere else.
> Which means for the user that he can either trust SGE to dispatch the job, or specify the queue to have a chance never to get suspended.
> Does it sound like a strange/bad configuration to you ?
Yes, filling the cluster from both sides is a good approach.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users