[GE users] Intermittent daemon on node failed to start

reuti reuti at staff.uni-marburg.de
Fri Jan 22 13:46:31 GMT 2010


Hi,

Am 22.01.2010 um 11:22 schrieb cgull:

> We are seeing jobs crash intermittently giving an error of, or  
> similar to:
> [nuvo:25943] ERROR: A daemon on node nuvo2 failed to start as  
> expected.
> [nuvo:25943] ERROR: There may be more information available from
> [nuvo:25943] ERROR: the 'qstat -t' command on the Grid Engine tasks.
> [nuvo:25943] ERROR: If the problem persists, please restart the
> [nuvo:25943] ERROR: Grid Engine PE job
> [nuvo:25943] ERROR: The daemon exited unexpectedly with status 1.

I assume you are using Open MPI.

A good place to start is to check the spooling files which are by  
default located in $SGE_ROOT/default/spool/<nodename>/messages and  
$SGE_ROOT/default/spool/qmaster/messages for the sgemaster.

You face this only from time to time?

Filesystem full on any nodes, in this case nuvo2?

The designated tmpdir from the queue setup is not available on nuvo2  
or has wrong permissions set?

-- Reuti


> The job then hangs.
> If we then rerun the job (same script) it runs fine.
>
> Doing a "qstat -t" gives this output MASTER                         
> r     00:41:01 0.73703 0.00000
>                                                                    
> nuvo at nuvo
>                  SLAVE            8.nuvo       r     00:00:00  
> 0.00034 0.00000
>                                                                    
> nuvo at nuvo
>                  SLAVE
>     864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28  
> nuvo at nuvo2
>                  SLAVE
>                                                                    
> nuvo at nuvo2
>                  SLAVE
>     864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28  
> nuvo at nuvo3
>                  SLAVE
>                                                                    
> nuvo at nuvo3
>                  SLAVE            8.nuvo3      r     00:00:00  
> 0.00034 0.00000
>     864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28  
> nuvo at nuvo4
>                  SLAVE
>                                                                    
> nuvo at nuvo4
>                  SLAVE            8.nuvo4      r     00:00:00  
> 0.00037 0.00000
>     864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28  
> nuvo at nuvo5
>                  SLAVE
>                                                                    
> nuvo at nuvo5
>                  SLAVE            8.nuvo5      r     00:00:00  
> 0.00068 0.00000
>     864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28  
> nuvo at nuvo6
>                  SLAVE
>                                                                    
> nuvo at nuvo6
>                  SLAVE            8.nuvo6      r     00:00:00  
> 0.00034 0.00000
>
> So looks like task id has disappeared from this job for nuvo2 (or  
> not launched as hinted)?
>
> How can we find out further why this daemon has not launched and  
> what has happened to try and solve this issue?
>
> Thanks for your time in advance.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=240347
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240379

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list