[GE users] Fixed allocation rule limit?

reuti reuti at staff.uni-marburg.de
Tue Dec 22 10:58:12 GMT 2009


Hi,

Am 21.12.2009 um 20:31 schrieb jcd:

> Reuti-
> As the tight integration didn't work with our Kerberos/AFS  
> integration,
> I came up with the following solution:
> PeHostfile2MachineFileproc()
> {
>     cat $1 | while read line; do
>        # echo $line
>        host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
>        nslots=`echo $line|cut -f2 -d" "`
>        echo $host":"$nslots
>     done
> }
> Then mpi will start only one server via xinetd. It seems to work fine.
> Comments?

yes, this is ok. As you are using MPICH1, you compiled it with - 
comm=shared I assume and also your applications?

For Kerberos+AFS there also special entries in SGE's configuration to  
get tokens. But as I never used it, I can't make any statment about  
it. I just found that these entries are even deprecated - will there  
be any replacement?

-- Reuti


> JC
>
>
>
>
> Jean-Christophe Ducom wrote:
>> Thanks Reuti.
>> We are using Kerberos+AFS (I used old afs structure in SGE to pass
>> Kerberos ticket on the master node. I could disgress for a long  
>> time) I
>> didn't get any success with tight integration last time I  
>> tried...I'll
>> give it another shot.
>>
>> JC
>>
>>
>> On Fri, Dec 11, 2009 at 5:52 PM, reuti <reuti at staff.uni-marburg.de
>> <mailto:reuti at staff.uni-marburg.de>> wrote:
>>
>>     Hi,
>>
>>     Am 11.12.2009 um 18:39 schrieb jcd:
>>
>>> All-
>>> I'm running SGE6.2u1 on RHEL5.4 cluster. Our cluster nodes are 2  
>>> dual
>>> quad-core Nehalem machines i.e. each machines have 8 slots from a  
>>> sge
>>> point of view.
>>> I'm having some issue when I submit a job using more than 6 cores
>>> per nodes.
>>>
>>> Here is the queue I use:
>>> qname                 wang
>>> hostlist              @wang
>>> seq_no                0
>>> load_thresholds       NONE
>>> suspend_thresholds    NONE
>>> nsuspend              1
>>> suspend_interval      00:05:00
>>> priority              0
>>> min_cpu_interval      00:05:00
>>> processors            UNDEFINED
>>> qtype                 BATCH
>>> ckpt_list             NONE
>>> pe_list               mpich1 mvapich2 ompi smp ompi-8way
>>> rerun                 FALSE
>>> slots                 8
>>> tmpdir                /tmp
>>> shell                 /bin/csh
>>> prolog                NONE
>>> epilog                NONE
>>> shell_start_mode      unix_behavior
>>> starter_method        NONE
>>> suspend_method        NONE
>>> resume_method         NONE
>>> terminate_method      NONE
>>> notify                00:00:60
>>> owner_list            NONE
>>> user_lists            crc wang
>>> xuser_lists           NONE
>>> subordinate_list      NONE
>>> complex_values        NONE
>>> projects              NONE
>>> xprojects             NONE
>>> calendar              NONE
>>> initial_state         default
>>> s_rt                  INFINITY
>>> h_rt                  INFINITY
>>> s_cpu                 INFINITY
>>> h_cpu                 INFINITY
>>> s_fsize               INFINITY
>>> h_fsize               INFINITY
>>> s_data                INFINITY
>>> h_data                INFINITY
>>> s_stack               INFINITY
>>> h_stack               INFINITY
>>> s_core                INFINITY
>>> h_core                INFINITY
>>> s_rss                 INFINITY
>>> h_rss                 INFINITY
>>> s_vmem                INFINITY
>>> h_vmem                INFINITY
>>>
>>> and the mpich1 pe is defined as following:
>>> pe_name            mpich1
>>> slots              800
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    /opt/sge/mpi/startmpi.sh $pe_hostfile
>>> stop_proc_args     /opt/sge/mpi/stopmpi.sh
>>> allocation_rule    8
>>> control_slaves     FALSE
>>> job_is_first_task  TRUE
>>> urgency_slots      min
>>> accounting_summary FALSE
>>>
>>>
>>> The submission script is:
>>> #!/bin/csh
>>> #$ -pe mpich1 16
>>> module load mpich1/1.2.7p1-intel
>>> mpirun -np $NSLOTS -machinefile $TMPDIR/machines ./cpi
>>>
>>>
>>> That job requires then 2 nodes. Here is the error I get
>>> wang003
>>> wang003
>>> wang003
>>> wang003
>>> wang003
>>> wang003
>>> wang003
>>> wang003
>>> wang006
>>> wang006
>>> wang006
>>> wang006
>>> wang006
>>> wang006
>>> wang006
>>> wang006
>>> rm_15125:  p4_error: interrupt SIGx: 13
>>> rm_15125: (0.980992) net_send: could not write to fd=4, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=6, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=7, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=8, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=9, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=10, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=11, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=12, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=13, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=14, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=15, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=16, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=17, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=18, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=19, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=20, errno = 32
>>> rm_15125: (0.980992) net_send: could not write to fd=5, errno = 32
>>> p0_14797: (23.029760) net_send: could not write to fd=4, errno = 32
>>>
>>> If I reduce the number of slots in the allocation_rule to something
>>> <=6
>>> (the job will use 12 processors), everything works fine.
>>> Needless to say that fill_up rule doesn't work as it tries to use  
>>> all
>>> 8cores.
>>>
>>> So my question is: is 6 a magic number for a fixed allocation rule?
>>
>>     no, there is no such limit. What I see in the PE definition  
>> is, that
>>     you use only a Loose Integration of your job into SGE. Hence  
>> all rsh
>>     or ssh requests will arrive at once (more or less) at the daemon
>>     running there. With a Thight Integration each and every rsh  
>> call will
>>     get a damon on its own (or will just use the -builtin- method).
>>
>>     Can you try to setup a Tight Integration of your job? There  
>> are hints
>>     in $SGE_ROOT/mpi and a Howto http://gridengine.sunsource.net/ 
>> howto/
>>     mpich-integration.html
>>
>>     -- Reuti
>>
>>
>>> JC
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=232822
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net
>>     <mailto:unsubscribe at gridengine.sunsource.net>].
>>
>>     ------------------------------------------------------
>>     http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=232860
>>     <http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=232860>
>>
>>     To unsubscribe from this discussion, e-mail:
>>     [users-unsubscribe at gridengine.sunsource.net
>>     <mailto:users-unsubscribe at gridengine.sunsource.net>].
>>
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=234505
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234585

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list