[GE users] MPI woes
reuti at staff.uni-marburg.de
Thu Nov 4 14:39:02 GMT 2010
Am 04.11.2010 um 14:07 schrieb heine:
> Whenever I try to submit a job that would involve more than one node using allocation_rule $round_robin in a pe for example I get the following error.
> error: commlib error: got read error (closing "comp009/execd/1")
> error: commlib error: got read error (closing "comp010/execd/1")
> error: executing task of job 62162 failed: failed sending task to execd at comp009: can't find connection
> error: executing task of job 62162 failed: failed sending task to execd at comp010: can't find connection
> I have tried with openmpi and mpich2, different versions even and the results remain the same. If I use a submit script to just copy the $TMPDIR/machines, submit the job and then use the machines file with the following example, it work 100%. 'mpiexec -np 3 --machinefile machines hostname', but adding the exact same back into the submit script generates the above error once more.
you mean in your test you reuse the generated machinefile from the last run, but use this then on the headnode of the cluster? Then the communication will be different, as in your jobscript the `mpiexec ...` will also be executed on one of your slaves.
Anyway: what is the complete definition of your PE (qconf -sp ...) and SGE's configuration (qconf -sconf)? Do you try to achieve a tight integration and compiled Open MPI with --with-sge?
When you have a tight integration, the parallel library won't use a direct rsh/ssh, but route it through `qrsh -inherit ...` which seems to fail.
> Any help would be appreciated.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users