[GE users] Intermittent daemon on node failed to start

cgull matt.mcnally at virgin.net
Fri Jan 22 10:22:46 GMT 2010


We are seeing jobs crash intermittently giving an error of, or similar to:
[nuvo:25943] ERROR: A daemon on node nuvo2 failed to start as expected.
[nuvo:25943] ERROR: There may be more information available from
[nuvo:25943] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[nuvo:25943] ERROR: If the problem persists, please restart the
[nuvo:25943] ERROR: Grid Engine PE job
[nuvo:25943] ERROR: The daemon exited unexpectedly with status 1.
The job then hangs. 
If we then rerun the job (same script) it runs fine. 

Doing a "qstat -t" gives this output MASTER                        r     00:41:01 0.73703 0.00000
                                                                  nuvo at nuvo
                 SLAVE            8.nuvo       r     00:00:00 0.00034 0.00000
                                                                  nuvo at nuvo
                 SLAVE
    864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28 nuvo at nuvo2
                 SLAVE
                                                                  nuvo at nuvo2
                 SLAVE
    864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28 nuvo at nuvo3
                 SLAVE
                                                                  nuvo at nuvo3
                 SLAVE            8.nuvo3      r     00:00:00 0.00034 0.00000
    864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28 nuvo at nuvo4
                 SLAVE
                                                                  nuvo at nuvo4
                 SLAVE            8.nuvo4      r     00:00:00 0.00037 0.00000
    864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28 nuvo at nuvo5
                 SLAVE
                                                                  nuvo at nuvo5
                 SLAVE            8.nuvo5      r     00:00:00 0.00068 0.00000
    864 0.55500 UK47-RUN93 mlayton      r     01/21/2010 05:52:28 nuvo at nuvo6
                 SLAVE
                                                                  nuvo at nuvo6
                 SLAVE            8.nuvo6      r     00:00:00 0.00034 0.00000

So looks like task id has disappeared from this job for nuvo2 (or not launched as hinted)? 

How can we find out further why this daemon has not launched and what has happened to try and solve this issue?

Thanks for your time in advance.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240347

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list