[GE users] MPI woes
heine at sun.ac.za
Thu Nov 4 13:07:04 GMT 2010
[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
Whenever I try to submit a job that would involve more than one node using allocation_rule $round_robin in a pe for example I get the following error.
error: commlib error: got read error (closing "comp009/execd/1")
error: commlib error: got read error (closing "comp010/execd/1")
error: executing task of job 62162 failed: failed sending task to execd at comp009<mailto:execd at comp009>: can't find connection
error: executing task of job 62162 failed: failed sending task to execd at comp010<mailto:execd at comp010>: can't find connection
I have tried with openmpi and mpich2, different versions even and the results remain the same. If I use a submit script to just copy the $TMPDIR/machines, submit the job and then use the machines file with the following example, it work 100%. 'mpiexec -np 3 --machinefile machines hostname', but adding the exact same back into the submit script generates the above error once more.
Any help would be appreciated.
More information about the gridengine-users