[GE users] MPICH job - 1 failed host
reuti at staff.uni-marburg.de
Thu Dec 4 11:20:23 GMT 2008
Am 04.12.2008 um 03:38 schrieb Botka, Christopher:
> Does anyone know what happens to an MPI job when a host dies? I
> have a somewhat long (for me) MPICH job using 14 slots. I had one
> slave node with 2 slots fail. Rec'd an immediate notification of
> the job failure on the host for both slots. However, the rest are
> still running - and the application is still sending progress data
> to the stdout. I can't see results until the job completes,
why, you can peek at the output files all the time?
> but the remainder of the hosts have continued to run the job for
> >24 hours post failure.
Is it really doing something (i.e. on the nodes) or just showing up
> It should be relatively easy to determine if the output is
> generally correct once the job completes. Just wondering if anyone
> has any personal experience with this sort of thing. I am not
> anxious to start over, but don't want to waste the cycles if I
> SGE 6.1u4 on fedora 7/8
> MEME v3.5.7, mpich2 v1.0.7
So the job should abort, but as SGE assumes a network failure, it
waits for a return of the node. You can have a look at "max_unheard"
and "reschedule_unknown" in the SGE's configuration (qconf -mconf).
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users