[GE users] GE6.2 num_proc consumable strange behavior for mpi $fillup jobs

jlopez jlopez at cesga.es
Mon Feb 2 09:43:44 GMT 2009


Hi Reuti,
>>
>> With this configuration I am able to run 16 mpi jobs of num_proc=16 in
>> the same node. Did I missed something?
>>     
>
> How do you submit the parallel job: qsub -pe mpi 16 ...? Then the  
> slot count should be done automatically.
>
> -- Reuti
>
>
>   
Just done the tests with all the settings you suggested in another clean 
test system (just in case there is someting "strange" in the other 
testing environment) and the slot are not correctly taken into account:

jlopez at dn001:~> qconf -srqs
{
   name         limit_slots_using_num_proc
   description  Limitemos los slots por nodo utilizando el valor de num_proc
   enabled      TRUE
   limit        name nodelimit hosts {*} to slots=$num_proc
}
jlopez at dn001:~> qconf -sc|grep num_proc
num_proc            p          INT         <=    YES         YES        
0        0
jlopez at dn001:~> qconf -se dn001
hostname              dn001.null
load_scaling          NONE
complex_values        NONE
load_values           arch=lx24-ia64,num_proc=4,mem_total=3895.093750M, \
                      swap_total=1027.562500M,virtual_total=4922.656250M, \
                      load_avg=0.020000,load_short=0.020000, \
                      load_medium=0.020000,load_long=0.000000, \
                      mem_free=3047.515625M,swap_free=1023.875000M, \
                      virtual_free=4071.390625M,mem_used=847.578125M, \
                      swap_used=3.687500M,virtual_used=851.265625M, \
                      cpu=0.700000,np_load_avg=0.005000, \
                      np_load_short=0.005000,np_load_medium=0.005000, \
                      np_load_long=0.000000
processors            4
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE
jlopez at dn001:~> qconf -sp mpi
pe_name            mpi
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/cesga/sge62/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args     /opt/cesga/sge62/mpi/stopmpi.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
jlopez at dn001:~> qconf -sq sistemas
qname                 sistemas
hostlist              dn001.null dn002.null
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpi mpi_rr
rerun                 FALSE
slots                 4
tmpdir                /scratch
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            sistemas
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY

And here you can see that it doesn't consider correctly the total number 
of processors in one node:

jlopez at dn001:~> qsub -w v -l num_proc=4,s_vmem=512M,h_fsize=1G,s_rt=300 
-pe mpi 4 -q 'sistemas at dn001' test.sh
verification: found suitable queue(s)

But node dn001 has only num_proc=4.

This problem is only happening in GE6.2, our configuration is working in 
previous versions of GE.

Cheers,
Javier

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=101302

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Text/X-VCARD (charset: UTF-8 "Internet-standard Unicode") ]
    [ (Name: "jlopez.vcf") 14 lines. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list