[GE users] Fixed allocation rule limit?

jcd jcducom at gmail.com
Fri Dec 11 17:39:41 GMT 2009


All-
I'm running SGE6.2u1 on RHEL5.4 cluster. Our cluster nodes are 2 dual 
quad-core Nehalem machines i.e. each machines have 8 slots from a sge 
point of view.
I'm having some issue when I submit a job using more than 6 cores per nodes.

Here is the queue I use:
qname                 wang
hostlist              @wang
seq_no                0
load_thresholds       NONE
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               mpich1 mvapich2 ompi smp ompi-8way
rerun                 FALSE
slots                 8
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            crc wang
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

and the mpich1 pe is defined as following:
pe_name            mpich1
slots              800
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/sge/mpi/startmpi.sh $pe_hostfile
stop_proc_args     /opt/sge/mpi/stopmpi.sh
allocation_rule    8
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE


The submission script is:
#!/bin/csh
#$ -pe mpich1 16
module load mpich1/1.2.7p1-intel
mpirun -np $NSLOTS -machinefile $TMPDIR/machines ./cpi


That job requires then 2 nodes. Here is the error I get
wang003
wang003
wang003
wang003
wang003
wang003
wang003
wang003
wang006
wang006
wang006
wang006
wang006
wang006
wang006
wang006
rm_15125:  p4_error: interrupt SIGx: 13
rm_15125: (0.980992) net_send: could not write to fd=4, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=6, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=7, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=8, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=9, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=10, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=11, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=12, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=13, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=14, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=15, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=16, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=17, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=18, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=19, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=20, errno = 32
rm_15125: (0.980992) net_send: could not write to fd=5, errno = 32
p0_14797: (23.029760) net_send: could not write to fd=4, errno = 32

If I reduce the number of slots in the allocation_rule to something <=6 
(the job will use 12 processors), everything works fine.
Needless to say that fill_up rule doesn't work as it tries to use all 
8cores.

So my question is: is 6 a magic number for a fixed allocation rule?

JC

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232822

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list