[GE users] MPICH job - 1 failed host

Botka, Christopher Christopher.Botka at joslin.harvard.edu
Thu Dec 4 02:38:45 GMT 2008

Does anyone know what happens to an MPI job when a host dies?  I have a somewhat long (for me) MPICH job using 14 slots.  I had one slave node with 2 slots fail.  Rec'd an immediate notification of the job failure on the host for both slots.  However, the rest are still running - and the application is still sending progress data to the stdout.  I can't see results until the job completes, but the remainder of the hosts have continued to run the job for >24 hours post failure.

It should be relatively easy to determine if the output is generally correct once the job completes.  Just wondering if anyone has any personal experience with this sort of thing.  I am not anxious to start over, but don't want to waste the cycles if I shouldn't.

SGE 6.1u4 on fedora 7/8
MEME v3.5.7, mpich2 v1.0.7

Thanks much,



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list