[GE users] Rmpi under SGE

reuti reuti at staff.uni-marburg.de
Tue Dec 21 12:56:44 GMT 2010


Am 17.12.2010 um 18:12 schrieb arnuschky:

> On Fri, 2010-12-17 at 15:02 +0100, reuti wrote:
>> Am 17.12.2010 um 12:58 schrieb arnuschky:
>> 
>>> Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
>>> still fail (even with Reuti's fixes):
>>> 
>>>       $ cat test-mpi-17942.e3480568
>>>       error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
>>>       error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
>> 
>> You are now using the plain -builtin- startup method? Does it happen on all hosts for such a job?
> 
> Yes, here's my current config:
> 
> <snip>
>        qlogin_command               builtin
>        qlogin_daemon                builtin
>        rlogin_command               builtin
>        rlogin_daemon                builtin
>        rsh_command                  builtin
>        rsh_daemon                   builtin

Fine.

>        reprioritize                 0
>        jsv_url                      none
>        jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
> 
>        $ qconf -sp mpich_fu
>        pe_name            mpich_fu
>        slots              128
>        user_lists         NONE
>        xuser_lists        NONE
>        start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
>        stop_proc_args     /opt/gridengine/mpi/stopmpi.sh

Aren't you using Open MPI? Then these two entries can be set to NONE.

Open MPI on its own is working fine with more than 20 nodes? How is Open MPI called by R - just a plain `mpiexec`, or any special arguments?

-- Reuti


>        allocation_rule    $fill_up
>        control_slaves     TRUE
>        job_is_first_task  FALSE
>        urgency_slots      min
>        accounting_summary FALSE
> 
> 
>> Maybe it's something special on some nodes und would for some hosts happen for less slots too.
> 
> I don't think that the nodes are different. I reinstalled all of them
> yesterday. I tested on 2 different generations of nodes separately (2x2
> cores and 2x4 cores per node). The problem just seems to be more likely
> the more slots (and thus nodes) I use. But in one generation the nodes
> are identical, all using a single switch.
> 
> Cheers,
> Arne
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306505
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307827

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list