[GE users] MPI woes

heine heine at sun.ac.za
Thu Nov 4 13:07:04 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Whenever I try to submit a job that would involve more than one node using allocation_rule $round_robin in a pe for example I get the following error.

error: commlib error: got read error (closing "comp009/execd/1")
error: commlib error: got read error (closing "comp010/execd/1")
error: executing task of job 62162 failed: failed sending task to execd at comp009<mailto:execd at comp009>: can't find connection
error: executing task of job 62162 failed: failed sending task to execd at comp010<mailto:execd at comp010>: can't find connection

I have tried with openmpi and mpich2, different versions even and the results remain the same. If I use a submit script to just copy the $TMPDIR/machines, submit the job and then use the machines file with the following example, it work 100%. 'mpiexec -np 3 --machinefile machines hostname', but adding the exact same back into the submit script generates the above error once more.

Any help would be appreciated.

Heine



More information about the gridengine-users mailing list