[GE users] SGE Master Daemon died

Richard Ems Richard.Ems at cape-horn-eng.com
Mon Apr 14 14:28:18 BST 2008

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi all!

This is SGE-6.1u2, running on openSUSE-10.3-64bit.

Last Saturday I realized that the sge_qmaster process was not running 
anymore (well, NAGIOS checked it and mailed me! 8) ).
I then restarted SGE to see both sge_qmaster and sge_schedd starting 
again, but sge_qmaster dying again.

On spool/qmaster/messages I found

04/12/2008 15:49:33|qmaster|c3m|I|starting up GE 6.1u2 (lx24-amd64)
cqueue == NULL("(null)", "(null)", 1, 0
04/12/2008 15:49:53|qmaster|c3m|E|writing job finish information: can't 
locate queue "(null)@(null)"
04/12/2008 15:49:53|qmaster|c3m|W|job 35026.1 failed on host <unknown 
host> before writing exit_status because: shepherd exited with exit 
status 19
04/12/2008 15:49:53|qmaster|c3m|C|!!!!!!!!!! got NULL element for 
QU_rerun !!!!!!!!!!

So I checked job 35026 and I realized that the node this job was running 
on was not reachable anymore, at least a login was not possible, ping 

But why did sge_qmaster died with this error? I already had in the past 
many nodes dying (mostly hard discs hanging), but SGE always reacted 
nicely on me, continuing doing it's job, and not using the dead node 
Why did sge_qmaster died this time?
What can I do to avoid this?

The only change I did last Friday was setting "schedd_job_info" 
          to "false", because of the "memory leak / immense memory 
consumption", see 
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464 .

Thanks for any help, Richard

Richard Ems       mail: Richard.Ems at Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5? piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list