[GE users] SGE Master Daemon died (2) / can't locate queue "(null)@(null)"

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Tue Apr 22 11:02:08 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Richard,

I have seen it. Apriori I can't tell why you get these crashes since
I don't know where exaclty it happens. As in so many other cases the 
problem is that no core file gets dumped so that you can't answer my 
question either.

As to improve this unpleasent situation I'd ask you to apply the 
libcore.so via LD_PRELOAD

    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2552

in your sgemaster script so that you get a core dump when it happens next 
time. I think it would be adequate to have the

    prctl(PR_SET_DUMPABLE,1,42,42,42)

available already in regular binaries, so that nobody must fiddle around 
anymore, but as of now it isn't.

Kind regards,
Andreas


On Tue, 22 Apr 2008, Richard Ems wrote:

> Hi all, hi Andreas!
>
> This is SGE-6.1u2, running on openSUSE-10.3-64bit.
>
> On Saturday 12. Apr I realized that the sge_qmaster process was not running 
> anymore (well, NAGIOS checked it and mailed me! 8) ).
> I then restarted SGE to see both sge_qmaster and sge_schedd starting
> again, but sge_qmaster dying again.
>
> On spool/qmaster/messages I found
>
> 04/12/2008 15:49:33|qmaster|c3m|I|starting up GE 6.1u2 (lx24-amd64)
> 04/12/2008
> 15:49:53|qmaster|c3m|E|cqueue_list_locate_qinstance("(null)@(null)"):
> cqueue == NULL("(null)", "(null)", 1, 0
> 04/12/2008 15:49:53|qmaster|c3m|E|writing job finish information: can't
> locate queue "(null)@(null)"
> 04/12/2008 15:49:53|qmaster|c3m|W|job 35026.1 failed on host <unknown
> host> before writing exit_status because: shepherd exited with exit
> status 19
> 04/12/2008 15:49:53|qmaster|c3m|C|!!!!!!!!!! got NULL element for
> QU_rerun !!!!!!!!!!
>
>
> So I checked job 35026 and I realized that the node this job was running
> on was not reachable anymore, at least a login was not possible, ping
> worked.
>
> But why did sge_qmaster died with this error? I already had in the past
> many nodes dying (mostly hard discs hanging), but SGE always reacted
> nicely on me, continuing doing it's job, and not using the dead node
> anymore.
> Why did sge_qmaster died this time?
> What can I do to avoid this?
>
> The only change I did last Friday was setting "schedd_job_info"
>         to "false", because of the "memory leak / immense memory
> consumption", see
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464 .
>
> Andreas, any ideas?
>
> Thanks for any help, Richard
>
>
> -- 
> Richard Ems       mail: Richard.Ems at Cape-Horn-Eng.com
>
> Cape Horn Engineering S.L.
> C/ Dr. J.J. Dómine 1, 5? piso
> 46011 Valencia
> Tel : +34 96 3242923 / Fax 924
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering



    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list