[GE users] qrsh connection fails

Nick Wilson nw.gridengine.7901 at family-wilson.me.uk
Fri Nov 28 17:27:04 GMT 2008


Hi,

Thanks for your help.

>Am 26.11.2008 um 21:49 schrieb nw.gridengine.7901 at family-wilson.me.uk:
>
>> I have a problem when running OpenMPI jobs on lots of nodes which I
>> hope you can help with.
>>
>> It is using qrsh to connect to the remote nodes and I see this
>> error message:
>>
>> ssh_exchange_identification: Connection closed by remote host
>> [compute-0-16.local:17387] ERROR: A daemon on node compute-1-3
>> failed to start as expected.
>> [compute-0-16.local:17387] ERROR: There may be more information
>> available from
>> [compute-0-16.local:17387] ERROR: the 'qstat -t' command on the
>> Grid Engine tasks.
>> [compute-0-16.local:17387] ERROR: If the problem persists, please
>> restart the
>> [compute-0-16.local:17387] ERROR: Grid Engine PE job
>
>to me it looks not like an SGE problem, but something with the SSH
>daemon on the node. You use SSH, the default rsh wasn't sufficient
>for you?
>

The default setup of Rocks has rsh disabled so it's set up to use ssh. I'm attempting to enable rsh and see if that works but have hit a "poll: protocol failure in circuit setup" error.

>Do you find anything in /var/log/warn, /var/log/messages and alike?

No, there isn't anything in the logs.

>
>Is it happening on random hosts or always on the same, which would
>lead to a setup problem of a node?

It's different hosts each time.


>
>One option could be to increase "MaxStartups" in sshd_config on the
>machines.

I tried increasing MaxStartups in sshd_config but that didn't help.

I also tried adding "ConnectionAttempts 4" to ssh_config but that didn't help either.

>
>
>> It works OK for a small number of nodes but fails when I try on 32
>> nodes.
>>
>> I got a similar error message when I did this test:
>>
>> #!/bin/bash
>> #$ -S /bin/bash
>> #$ -cwd
>> #$ -pe mpich 44
>> for h in $(<$TMPDIR/machines) ; do
>> /opt/gridengine/bin/lx26-amd64/qrsh -inherit -noshell -nostdin -V
>> $h hostname &
>> done
>> wait
>>
>> so it appears to be a qrsh issue rather than an OpenMPI issue. Are
>> there any settings I can change to fix this?
>>
>> The version of GridEngine is sge-V60u8-1 (installed as part of
>> Rocks 4.2.1).
>
>You could upgrade to 6.2 and try the new builtin startup method.

I would prefer not to but it may be necessary.

Thanks,
Nick Wilson
----
Environment and Health Research Division
Fujitsu Laboratories of Europe

>
>-- Reuti
>
>
>> Thanks,
>> Nick Wilson
>> ----
>> Environment and Health Research Division
>> Fujitsu Laboratories of Europe
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=90024
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
>------------------------------------------------------
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=9016
>1
>
>To unsubscribe from this discussion, e-mail:
>[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=90281

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list