[GE users] qrsh connection fails

reuti reuti at staff.uni-marburg.de
Thu Nov 27 23:13:24 GMT 2008


Hi,

Am 26.11.2008 um 21:49 schrieb nw.gridengine.7901 at family-wilson.me.uk:

> I have a problem when running OpenMPI jobs on lots of nodes which I  
> hope you can help with.
>
> It is using qrsh to connect to the remote nodes and I see this  
> error message:
>
> ssh_exchange_identification: Connection closed by remote host
> [compute-0-16.local:17387] ERROR: A daemon on node compute-1-3  
> failed to start as expected.
> [compute-0-16.local:17387] ERROR: There may be more information  
> available from
> [compute-0-16.local:17387] ERROR: the 'qstat -t' command on the  
> Grid Engine tasks.
> [compute-0-16.local:17387] ERROR: If the problem persists, please  
> restart the
> [compute-0-16.local:17387] ERROR: Grid Engine PE job

to me it looks not like an SGE problem, but something with the SSH  
daemon on the node. You use SSH, the default rsh wasn't sufficient  
for you?

Do you find anything in /var/log/warn, /var/log/messages and alike?

Is it happening on random hosts or always on the same, which would  
lead to a setup problem of a node?

One option could be to increase "MaxStartups" in sshd_config on the  
machines.


> It works OK for a small number of nodes but fails when I try on 32  
> nodes.
>
> I got a similar error message when I did this test:
>
> #!/bin/bash
> #$ -S /bin/bash
> #$ -cwd
> #$ -pe mpich 44
> for h in $(<$TMPDIR/machines) ; do
> /opt/gridengine/bin/lx26-amd64/qrsh -inherit -noshell -nostdin -V  
> $h hostname &
> done
> wait
>
> so it appears to be a qrsh issue rather than an OpenMPI issue. Are  
> there any settings I can change to fix this?
>
> The version of GridEngine is sge-V60u8-1 (installed as part of  
> Rocks 4.2.1).

You could upgrade to 6.2 and try the new builtin startup method.

-- Reuti


> Thanks,
> Nick Wilson
> ----
> Environment and Health Research Division
> Fujitsu Laboratories of Europe
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=90024
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=90161

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list