[GE users] GE6.2 num_proc consumable strange behavior for mpi $fillup jobs

reuti reuti at staff.uni-marburg.de
Mon Feb 2 11:05:26 GMT 2009


Am 02.02.2009 um 10:43 schrieb jlopez:

> Hi Reuti,
>>>
>>> With this configuration I am able to run 16 mpi jobs of  
>>> num_proc=16 in
>>> the same node. Did I missed something?
>>>
>>
>> How do you submit the parallel job: qsub -pe mpi 16 ...? Then the
>> slot count should be done automatically.
>>
>> -- Reuti
>>
>>
>>
> Just done the tests with all the settings you suggested in another  
> clean
> test system (just in case there is someting "strange" in the other
> testing environment) and the slot are not correctly taken into  
> account:
>
> jlopez at dn001:~> qconf -srqs
> {
>    name         limit_slots_using_num_proc
>    description  Limitemos los slots por nodo utilizando el valor de  
> num_proc
>    enabled      TRUE
>    limit        name nodelimit hosts {*} to slots=$num_proc
> }
> jlopez at dn001:~> qconf -sc|grep num_proc
> num_proc            p          INT         <=    YES         YES

This is an overlook on my side, sorry. It should be "YES NO".  
num_proc is a fixed feature of a node. Nothing to consume.

> 0        0
> jlopez at dn001:~> qconf -se dn001
> hostname              dn001.null
> load_scaling          NONE
> complex_values        NONE
> load_values           arch=lx24- 
> ia64,num_proc=4,mem_total=3895.093750M, \
>                        
> swap_total=1027.562500M,virtual_total=4922.656250M, \
>                       load_avg=0.020000,load_short=0.020000, \
>                       load_medium=0.020000,load_long=0.000000, \
>                       mem_free=3047.515625M,swap_free=1023.875000M, \
>                        
> virtual_free=4071.390625M,mem_used=847.578125M, \
>                       swap_used=3.687500M,virtual_used=851.265625M, \
>                       cpu=0.700000,np_load_avg=0.005000, \
>                        
> np_load_short=0.005000,np_load_medium=0.005000, \
>                       np_load_long=0.000000
> processors            4
> user_lists            NONE
> xuser_lists           NONE
> projects              NONE
> xprojects             NONE
> usage_scaling         NONE
> report_variables      NONE
> jlopez at dn001:~> qconf -sp mpi
> pe_name            mpi
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /opt/cesga/sge62/mpi/startmpi.sh -catch_rsh  
> $pe_hostfile
> stop_proc_args     /opt/cesga/sge62/mpi/stopmpi.sh
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary TRUE
> jlopez at dn001:~> qconf -sq sistemas
> qname                 sistemas
> hostlist              dn001.null dn002.null
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make mpi mpi_rr
> rerun                 FALSE
> slots                 4
> tmpdir                /scratch
> shell                 /bin/bash
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            sistemas
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
>
> And here you can see that it doesn't consider correctly the total  
> number
> of processors in one node:
>
> jlopez at dn001:~> qsub -w v -l  
> num_proc=4,s_vmem=512M,h_fsize=1G,s_rt=300
> -pe mpi 4 -q 'sistemas at dn001' test.sh
> verification: found suitable queue(s)

Don't request num_proc here:

$ qsub -w v -l num_proc=4,s_vmem=512M,h_fsize=1G,s_rt=300 -pe mpi 4 - 
q 'sistemas at dn001' test.sh

What behavior do you expect? You request 4 slots and there are 4 slots.

> But node dn001 has only num_proc=4.

You mean you expect num_proc to be multiplied by 4?

-- Reuti


> This problem is only happening in GE6.2, our configuration is  
> working in
> previous versions of GE.
>
> Cheers,
> Javier
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=101302
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].<jlopez.vcf>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=101323

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list