[GE users] MPICH job - 1 failed host

reuti reuti at staff.uni-marburg.de
Thu Dec 4 11:20:23 GMT 2008


Am 04.12.2008 um 03:38 schrieb Botka, Christopher:

> Does anyone know what happens to an MPI job when a host dies?  I  
> have a somewhat long (for me) MPICH job using 14 slots.  I had one  
> slave node with 2 slots fail.  Rec'd an immediate notification of  
> the job failure on the host for both slots.  However, the rest are  
> still running - and the application is still sending progress data  
> to the stdout.  I can't see results until the job completes,

why, you can peek at the output files all the time?

> but the remainder of the hosts have continued to run the job for  
> >24 hours post failure.

Is it really doing something (i.e. on the nodes) or just showing up  
in qstat?

> It should be relatively easy to determine if the output is  
> generally correct once the job completes.  Just wondering if anyone  
> has any personal experience with this sort of thing.  I am not  
> anxious to start over, but don't want to waste the cycles if I  
> shouldn't.
> SGE 6.1u4 on fedora 7/8
> MEME v3.5.7, mpich2 v1.0.7


So the job should abort, but as SGE assumes a network failure, it  
waits for a return of the node. You can have a look at "max_unheard"  
and "reschedule_unknown" in the SGE's configuration (qconf -mconf).

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list