[GE users] SGE Master Daemon died

Reuti reuti at staff.uni-marburg.de
Mon Apr 14 14:45:15 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Am 14.04.2008 um 15:28 schrieb Richard Ems:
> This is SGE-6.1u2, running on openSUSE-10.3-64bit.
>
> Last Saturday I realized that the sge_qmaster process was not  
> running anymore (well, NAGIOS checked it and mailed me! 8) ).
> I then restarted SGE to see both sge_qmaster and sge_schedd  
> starting again, but sge_qmaster dying again.
>
> On spool/qmaster/messages I found
>
> 04/12/2008 15:49:33|qmaster|c3m|I|starting up GE 6.1u2 (lx24-amd64)
> 04/12/2008 15:49:53|qmaster|c3m|E|cqueue_list_locate_qinstance 
> ("(null)@(null)"): cqueue == NULL("(null)", "(null)", 1, 0
> 04/12/2008 15:49:53|qmaster|c3m|E|writing job finish information:  
> can't locate queue "(null)@(null)"
> 04/12/2008 15:49:53|qmaster|c3m|W|job 35026.1 failed on host  
> <unknown host> before writing exit_status because: shepherd exited  
> with exit status 19
> 04/12/2008 15:49:53|qmaster|c3m|C|!!!!!!!!!! got NULL element for  
> QU_rerun !!!!!!!!!!

any file in /tmp which might give more information?

-- Reuti


> So I checked job 35026 and I realized that the node this job was  
> running on was not reachable anymore, at least a login was not  
> possible, ping worked.
>
> But why did sge_qmaster died with this error? I already had in the  
> past many nodes dying (mostly hard discs hanging), but SGE always  
> reacted nicely on me, continuing doing it's job, and not using the  
> dead node anymore.
> Why did sge_qmaster died this time?
> What can I do to avoid this?
>
> The only change I did last Friday was setting  
> "schedd_job_info"          to "false", because of the "memory  
> leak / immense memory consumption", see http:// 
> gridengine.sunsource.net/issues/show_bug.cgi?id=2464 .
>
>
> Thanks for any help, Richard
>
>
> -- 
> Richard Ems       mail: Richard.Ems at Cape-Horn-Eng.com
>
> Cape Horn Engineering S.L.
> C/ Dr. J.J. Dómine 1, 5? piso
> 46011 Valencia
> Tel : +34 96 3242923 / Fax 924
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list