[GE users] Scheduler died unexpectedly

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Fri Jan 25 15:13:46 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Nikhil,

On Thu, 24 Jan 2008, Mulley, Nikhil wrote:

> [probably the question is wrong or its entirely wrong to expect to have schedd dump a core? I see many people in the past have asked for something on the core file, but sadly I see that there is no response to them. Is it going to be the same here. Please no.]

Reason is that core creation is highly OS dependent. As an example see Ron's 
posting on Solaris utility coreadm(1)

    http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=2453
    http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=2938

below you ask for Solaris binaries, so I assume your scheduler runs under Solaris 
where you got coreadm(1).

> [subject of the discussion, could well be turned to for a plea of making the v6.1u4 build binaries atleast for solaris, be please made available]
>
> So, it turned out to be a problem with a memory leak in the scheduler as I am able to see the scheduler dying atleast once in a day. Is not it? I am using v6.1u3 BTW. And this problem is documented in
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=2187 seems to be fixed or the fix is available in V61_BRANCH for v6.1u4.

I see no correlation between your scheduler dying and the memory leak. If it were 
the memory leak you had surely observed a swapping master host before the schedd 
crash.

> Thanks Andreas for the changes.
>
> I see that v6.1u4 binaries are not available to the public yet, are there any plans to make them public anytime soon?

This Joachim or Andy may be able to tell. To my knowledge 6.1u4 is not yet complete 
and my contribution for #2187 was just a minor part of u4.

> Andreas, can I please request you to provide the binaries of v6.1u4, atleast for the solaris-amd64 and solaris-x86 architectures?

I would do so if I were convinced this will help you.

> I shall be very happy to test it in my environment and desperately want to avoid the nightmares of restarting the scheduler in v6.1u3 everytime it happens.

Then please let us try to understand the schedd crash in your particular case.

Thanks,
Andreas

>
> Thanks,
> Nikhil
>
> -----Original Message-----
> From: Mulley, Nikhil
> Sent: Sunday, January 13, 2008 4:27 PM
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] Scheduler died unexpectedly
>
> Can I ask schedd dump core when it dies next time? That would perhaps allow me to do generate some post-mortem report.
>
> -----Original Message-----
> From: Andreas.Haas at Sun.COM [mailto:Andreas.Haas at Sun.COM]
> Sent: Friday, January 11, 2008 6:10 PM
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] Scheduler died unexpectedly
>
> Hi Nikhil,
>
> best way is to run schedd under control of dbx/gdb. That way you don't
> need to care about a core dump for getting the stack trace.
>
> Note, you must have SGE_ND in environment as to prevent schedd daemonizing.
>
> Regards,
> Andreas
>
> On Thu, 10 Jan 2008, Mulley, Nikhil wrote:
>
>> Is there means of enabling the scheduler debugging ?
>>
>> -----Original Message-----
>> From: Mulley, Nikhil
>> Sent: Thursday, January 10, 2008 1:46 PM
>> To: users at gridengine.sunsource.net
>> Subject: [GE users] Scheduler died unexpectedly
>>
>> I want to look at why and how the scheduler died. I am using SGE
>> v6.0.11. Any (forensic) reports could be generated that why the
>> scheduler could have died in first place?
>>
>> First thing that I came to notice that scheduler is died as the
>> schedd.pid was referring to non-existing pid number on my qmaster host
>> (from the act_qmaster file), I was wondering why is that shadowd did not
>> notice this and did not start the schedd/qmaster on one of the shadow
>> masters ? Is this mechanism can be expected from the host running
>> shadowd?
>>
>> Thanks,
>> Nikhil
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> <°)))><
>
> http://gridengine.info/
>
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering



    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list