[GE users] OpenMPI job on stay on one node [Solved]

reuti reuti at staff.uni-marburg.de
Mon Sep 7 15:57:27 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 07.09.2009 um 16:19 schrieb sgexav:

> reuti a écrit :
>> Am 07.09.2009 um 16:11 schrieb sgexav:
>>
>>
>>> So for resume:
>>>
>>> Compile open mpi with the --with-sge option.
>>> Then enable qrsh via ssh:
>>>
>>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>>>
>>
>> Great!
>>
>> But be aware, that using the default SSH you will have wrong
>> accounting. If you want (or must) use SSH instead of any builtin/rsh
>> method, you would also need to recompile SGE with the option "-tight-
>> ssh" to compile a custom SSH version, which will honor the  
>> accounting.
>>
>> -- Reuti
>>
> Ah,
> i am not sure to want to recompile sge, while it comes with rocks, and
> rocks far from very stable.
> Do you advise me to enable rsh in rocks??

No. Not in the first approach.

When the builtin method is not working, maybe there are firewalls on  
all machines, blocking the communication on the random port SGE  
selects for each started daemon? Using the new builtin method would  
be preferable.

Do you have a private network to the nodes, and can adjust the  
firewall to be active only to the external world?

-- Reuti


> Xavier
>>> It works!!!!
>>> Thanks
>>> Xavier
>>>
>>> reuti a écrit :
>>>
>>>> Am 07.09.2009 um 15:18 schrieb sgexav:
>>>>
>>>>
>>>>
>>>>> reuti a écrit :
>>>>>
>>>>>
>>>>>> Am 07.09.2009 um 13:32 schrieb sgexav:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> <snip>
>>>>>>>> as Lydia wrote: you don't need this argument, just leave the
>>>>>>>> option -
>>>>>>>> machinefile ... out. Open MPI will detect the granted nodes on
>>>>>>>> its
>>>>>>>> own from the original pe_hostfile. The $TMPDIR/machines  
>>>>>>>> would be
>>>>>>>> created by the start_proc_args for other MPI libraries, but
>>>>>>>> can be
>>>>>>>> left out here hence the file won't be create
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> OK, doing it that way with "pe orte" et without mychinefile in
>>>>>>> mpirun
>>>>>>> command
>>>>>>> i see my run starting on the nodes, but i get this error
>>>>>>> error: error: ending connection before all data received
>>>>>>> error:
>>>>>>> <snip>
>>>>>>> What doe it mean?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Did you redefine the settings of (here the 6.2u3 setup with the
>>>>>> builtin method in former versions it was different):
>>>>>>
>>>>>> $ qconf -sconf
>>>>>> #global:
>>>>>> ...
>>>>>> qlogin_command               builtin
>>>>>> qlogin_daemon                builtin
>>>>>> rlogin_command               builtin
>>>>>> rlogin_daemon                builtin
>>>>>> rsh_command                  builtin
>>>>>> rsh_daemon                   builtin
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> i am using 6.2u2 dilivered with Rocks 5.2
>>>>> qconf -sconf gave:
>>>>>
>>>>> qlogin_command               builtin
>>>>> qlogin_daemon                builtin
>>>>> rlogin_command               builtin
>>>>> rlogin_daemon                builtin
>>>>> rsh_command                  builtin
>>>>> rsh_daemon                   builtin
>>>>> but also:
>>>>> qrsh_command                 /usr/bin/ssh
>>>>>
>>>>>
>>>> AFAIK there are no "qrsh_..." entries at all.
>>>>
>>>>
>>>>
>>>>> rsh_command                  /usr/bin/ssh
>>>>> rlogin_command               /usr/bin/ssh
>>>>>
>>>>>
>>>> Having only the last three set it's not sufficient for an SSH
>>>> integration. And unless SGE is compiled with a special flag,  
>>>> it's not
>>>> a Tight Integration anyway. I don't know, why ROCKS includes these
>>>> settings. If you want to go for SSH, you would need:
>>>>
>>>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>>>>
>>>> You used "qconf -mconf" and the last lines are always added  
>>>> again? Is
>>>> there any local configuration for each node, i.e. "qconf - 
>>>> sconfl" ahs
>>>> entries? When you have an uniform cluster, you can delete them all.
>>>>
>>>> ===
>>>>
>>>> To your second eMail: "builtin" is a new mechanism, which don't  
>>>> need
>>>> and rsh or ssh.
>>>>
>>>> ===
>>>>
>>>> You can have a cluster w/o active rsh and ssh, but still running
>>>> parallel apps buy SGE either "builtin" or former "rsh-replacement".
>>>> Even for (interactive) qlogin and rlogin, the telnetd and rshd must
>>>> be installed, but they don't need to be activated in /etc/xinetd.d/
>>>> rsh or .../telnet. Still a Tight Integration w/o the option to be
>>>> bypassed by the user, as for each command a dedicated daemon to  
>>>> login
>>>> will be launched.
>>>>
>>>> -- Reuti
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=216256
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=216257
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=216260
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=216261
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=216274

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list