[GE users] qrsh connection fails

nw.gridengine.7901 at family-wilson.me.uk nw.gridengine.7901 at family-wilson.me.uk
Wed Nov 26 20:49:52 GMT 2008


I have a problem when running OpenMPI jobs on lots of nodes which I hope you can help with.
It is using qrsh to connect to the remote nodes and I see this error message:
ssh_exchange_identification: Connection closed by remote host
[compute-0-16.local:17387] ERROR: A daemon on node compute-1-3 failed to start as expected.
[compute-0-16.local:17387] ERROR: There may be more information available from
[compute-0-16.local:17387] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[compute-0-16.local:17387] ERROR: If the problem persists, please restart the
[compute-0-16.local:17387] ERROR: Grid Engine PE job
It works OK for a small number of nodes but fails when I try on 32 nodes.
I got a similar error message when I did this test:

#$ -S /bin/bash
#$ -cwd
#$ -pe mpich 44
for h in $(<$TMPDIR/machines) ; do
/opt/gridengine/bin/lx26-amd64/qrsh -inherit -noshell -nostdin -V $h hostname &

so it appears to be a qrsh issue rather than an OpenMPI issue. Are there any settings I can change to fix this?

The version of GridEngine is sge-V60u8-1 (installed as part of Rocks 4.2.1).

Nick Wilson
Environment and Health Research Division
Fujitsu Laboratories of Europe


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list