[GE users] Missing slots on one node

j_polasek j-polasek at tamu.edu
Mon Apr 12 23:32:22 BST 2010

Howdy all,

I am sure I am missing something simple, so if someone can point me in the right direction i would be very thankful.

The issue I am having is node200 will only allocate two of the eight slots.  For example,  when an 8 slot job is submitted to the scheduler, if it starts on node200, it will start 2 processes on node200 and the remaining 6 processes on one of the remaining nodes.  If the exact same job starts on any other node in the queue, it allocates all 8 processes to a single node.

I am running a heterogeneous cluster with SGE 6.1u4  and one of my queues has 16 nodes with 8 cores each (128 slots).

The parallel environments fluent_pe ib-openmpi ib-mvapich) are set to 128 slots. and use the $fill-up allocation rule.

The qconf -sq I1.q shows

qname                 I1.q
hostlist              node200.cluster.private node201.cluster.private \
                      node202.cluster.private node203.cluster.private \
                      node204.cluster.private node205.cluster.private \
                      node206.cluster.private node207.cluster.private \
                      node208.cluster.private node209.cluster.private \
                      node210.cluster.private node211.cluster.private \
                      node212.cluster.private node213.cluster.private \
                      node214.cluster.private node215.cluster.private
seq_no                0
load_thresholds       np_load_avg=1.25
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make fluent_pe ib-openmpi ib-mvapich
rerun                 FALSE
slots                 1,[node200.cluster.private=8], \
                      [node201.cluster.private=8],[node202.cluster.private=8], \
                      [node203.cluster.private=8],[node204.cluster.private=8], \
                      [node205.cluster.private=8],[node206.cluster.private=8], \
                      [node209.cluster.private=8],[node210.cluster.private=8], \
                      [node211.cluster.private=8],[node212.cluster.private=8], \
                      [node213.cluster.private=8],[node214.cluster.private=8], \
                      [node215.cluster.private=8],[node207.cluster.private=8], \
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            a0d-me
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

qstat -g c shows

I1.q                              0.60     72     56    128      0      0 

Node200 is the only node acting this way.  I have not been able to find any setting that would cause this. Any ideas?



Jeff Polasek
Computer Systems Manager
Artie McFerrin Chemical Engineering Department
Texas A&M University
j-polasek at tamu.edu


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list