[GE users] Problem filling up cores on a node in a PE using fill_up allocation

leonardz leonardz at sickkids.ca
Tue Feb 17 15:48:54 GMT 2009


SGE 6.2u1

qmaster is installed on a Solaris 10 5/08 s10x_u5wos_10 X86 opteron system execd is installed on a SUSE Linux
Enterprise Server 10 SP2 (x86_64) dual core opteron nodes (4 cores per node) 


I am trying to have 2 parallel environments as we have only a GigE network. So the goal for PE ompitest is:

only allow parallel jobs which can fit on a single node: using pe_slots for allocation works - all tasks are scheduled
on one node, and if more tasks are requested  than cores on a node, that job does not get scheduled in this PE.

The goal for PE ompilargetest is to allocate all cores on a node before allocating cores on the next node using
fill_up for allocation : this will allow multi-node parallel jobs

this does not work. It schedules all tasks to only one node, and oversubscribes the node as long as $fillup is used.

If I want more tasks than cores on a node without oversubscription, I need to use round_robin, which guarantees more
stress on the network. I really want all cores on a node filled before tasks are scheduled on a different node.

Is it possible, with sge6.2u1 to pack nodes with tasks, before allocating cores to the next node without
oversubscription ?

Details

For most users we want to insist that of all the slots (16 in the test case) , each job can only be scheduled on a
single node with 4 cores, and not run over the network:
qconf -sp ompitest
pe_name            ompitest
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE

The only queue this can run on is ompitest.q and it limits the number of slots to 4 per job.

qconf -sq ompitest.q
qname                 ompitest.q
hostlist                BLAH
seq_no                1,[cn-r3-4=1],[cn-r3-5=2],[cn-r3-6=3],[cn-r3-7=4]
load_thresholds       np_load_avg=4.5
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               ompitest
rerun                 FALSE
slots                 4


This appears to work.

For users who need more cores, and do not communicate heavily between processes, I want all cores on a node to be used
before allocating to another node:

qconf -sp ompilargetest
pe_name            ompilargetest
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE


and the only queue to use this PE:
qconf -sq ompilargetest.q
qname                 ompilargetest.q
hostlist              BLAH
seq_no                STUFF
load_thresholds       np_load_avg=4.5
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               ompilargetest
rerun                 FALSE
slots                 8

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=108209

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list