[GE users] SGE Master Daemon died (2) / can't locate queue "(null)@(null)"

Richard Ems Richard.Ems at cape-horn-eng.com
Tue Apr 22 10:34:24 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi all, hi Andreas!

This is SGE-6.1u2, running on openSUSE-10.3-64bit.

On Saturday 12. Apr I realized that the sge_qmaster process was not 
running anymore (well, NAGIOS checked it and mailed me! 8) ).
I then restarted SGE to see both sge_qmaster and sge_schedd starting
again, but sge_qmaster dying again.

On spool/qmaster/messages I found

04/12/2008 15:49:33|qmaster|c3m|I|starting up GE 6.1u2 (lx24-amd64)
04/12/2008
15:49:53|qmaster|c3m|E|cqueue_list_locate_qinstance("(null)@(null)"):
cqueue == NULL("(null)", "(null)", 1, 0
04/12/2008 15:49:53|qmaster|c3m|E|writing job finish information: can't
locate queue "(null)@(null)"
04/12/2008 15:49:53|qmaster|c3m|W|job 35026.1 failed on host <unknown
host> before writing exit_status because: shepherd exited with exit
status 19
04/12/2008 15:49:53|qmaster|c3m|C|!!!!!!!!!! got NULL element for
QU_rerun !!!!!!!!!!


So I checked job 35026 and I realized that the node this job was running
on was not reachable anymore, at least a login was not possible, ping
worked.

But why did sge_qmaster died with this error? I already had in the past
many nodes dying (mostly hard discs hanging), but SGE always reacted
nicely on me, continuing doing it's job, and not using the dead node
anymore.
Why did sge_qmaster died this time?
What can I do to avoid this?

The only change I did last Friday was setting "schedd_job_info"
          to "false", because of the "memory leak / immense memory
consumption", see
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464 .

Andreas, any ideas?

Thanks for any help, Richard


-- 
Richard Ems       mail: Richard.Ems at Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5? piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list