[GE users] Problem filling up cores on a node in a PE using fill_up allocation

reuti reuti at staff.uni-marburg.de
Tue Feb 17 15:57:56 GMT 2009


Am 17.02.2009 um 16:48 schrieb leonardz:

> SGE 6.2u1
>
> qmaster is installed on a Solaris 10 5/08 s10x_u5wos_10 X86 opteron  
> system execd is installed on a SUSE Linux
> Enterprise Server 10 SP2 (x86_64) dual core opteron nodes (4 cores  
> per node)
>
>
> I am trying to have 2 parallel environments as we have only a GigE  
> network. So the goal for PE ompitest is:
>
> only allow parallel jobs which can fit on a single node: using  
> pe_slots for allocation works - all tasks are scheduled
> on one node, and if more tasks are requested  than cores on a node,  
> that job does not get scheduled in this PE.
>
> The goal for PE ompilargetest is to allocate all cores on a node  
> before allocating cores on the next node using
> fill_up for allocation : this will allow multi-node parallel jobs
>
> this does not work. It schedules all tasks to only one node, and  
> oversubscribes the node as long as $fillup is used.
>
> If I want more tasks than cores on a node without oversubscription,  
> I need to use round_robin, which guarantees more
> stress on the network. I really want all cores on a node filled  
> before tasks are scheduled on a different node.
>
> Is it possible, with sge6.2u1 to pack nodes with tasks, before  
> allocating cores to the next node without
> oversubscription ?

There is an issue with $fill_up in 6.2u1:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=2901

When it's not fixed in u2, I would even suggest to raise priority for  
this issue.

-- Reuti


> Details
>
> For most users we want to insist that of all the slots (16 in the  
> test case) , each job can only be scheduled on a
> single node with 4 cores, and not run over the network:
> qconf -sp ompitest
> pe_name            ompitest
> slots              16
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $pe_slots
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary TRUE
>
> The only queue this can run on is ompitest.q and it limits the  
> number of slots to 4 per job.
>
> qconf -sq ompitest.q
> qname                 ompitest.q
> hostlist                BLAH
> seq_no                1,[cn-r3-4=1],[cn-r3-5=2],[cn-r3-6=3],[cn- 
> r3-7=4]
> load_thresholds       np_load_avg=4.5
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH
> ckpt_list             NONE
> pe_list               ompitest
> rerun                 FALSE
> slots                 4
>
>
> This appears to work.
>
> For users who need more cores, and do not communicate heavily  
> between processes, I want all cores on a node to be used
> before allocating to another node:
>
> qconf -sp ompilargetest
> pe_name            ompilargetest
> slots              16
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary TRUE
>
>
> and the only queue to use this PE:
> qconf -sq ompilargetest.q
> qname                 ompilargetest.q
> hostlist              BLAH
> seq_no                STUFF
> load_thresholds       np_load_avg=4.5
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH
> ckpt_list             NONE
> pe_list               ompilargetest
> rerun                 FALSE
> slots                 8
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=108209
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=108213

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list