[GE users] Missing slots on one node

reuti reuti at staff.uni-marburg.de
Mon Apr 12 23:35:24 BST 2010


Hi,

Am 13.04.2010 um 00:32 schrieb j_polasek:

> Howdy all,
>
> I am sure I am missing something simple, so if someone can point me  
> in the right direction i would be very thankful.
>
> The issue I am having is node200 will only allocate two of the eight  
> slots.  For example,  when an 8 slot job is submitted to the  
> scheduler, if it starts on node200, it will start 2 processes on  
> node200 and the remaining 6 processes on one of the remaining  
> nodes.  If the exact same job starts on any other node in the queue,  
> it allocates all 8 processes to a single node.
>
> I am running a heterogeneous cluster with SGE 6.1u4  and one of my  
> queues has 16 nodes with 8 cores each (128 slots).

any load_adjustments in the exechost definition (qconf -se node200)?

-- Reuti


> The parallel environments fluent_pe ib-openmpi ib-mvapich) are set  
> to 128 slots. and use the $fill-up allocation rule.
>
> The qconf -sq I1.q shows
>
> qname                 I1.q
> hostlist              node200.cluster.private  
> node201.cluster.private \
>                      node202.cluster.private node203.cluster.private \
>                      node204.cluster.private node205.cluster.private \
>                      node206.cluster.private node207.cluster.private \
>                      node208.cluster.private node209.cluster.private \
>                      node210.cluster.private node211.cluster.private \
>                      node212.cluster.private node213.cluster.private \
>                      node214.cluster.private node215.cluster.private
> seq_no                0
> load_thresholds       np_load_avg=1.25
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make fluent_pe ib-openmpi ib-mvapich
> rerun                 FALSE
> slots                 1,[node200.cluster.private=8], \
>                      [node201.cluster.private=8], 
> [node202.cluster.private=8], \
>                      [node203.cluster.private=8], 
> [node204.cluster.private=8], \
>                      [node205.cluster.private=8], 
> [node206.cluster.private=8], \
>                      [node209.cluster.private=8], 
> [node210.cluster.private=8], \
>                      [node211.cluster.private=8], 
> [node212.cluster.private=8], \
>                      [node213.cluster.private=8], 
> [node214.cluster.private=8], \
>                      [node215.cluster.private=8], 
> [node207.cluster.private=8], \
>                      [node208.cluster.private=8]
> tmpdir                /tmp
> shell                 /bin/csh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            a0d-me
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
>
>
>
> qstat -g c shows
>
> I1.q                              0.60     72     56    128       
> 0      0
>
>
> Node200 is the only node acting this way.  I have not been able to  
> find any setting that would cause this. Any ideas?
>
> Thanks
>
> Jeff
>
>
>
> Jeff Polasek
> Computer Systems Manager
> Artie McFerrin Chemical Engineering Department
> Texas A&M University
> 979-845=3398
> j-polasek at tamu.edu
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253184
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253185

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list