[GE users] sgemaster keeps crashing 6.2u4

mhanby mhanby at uab.edu
Mon Mar 1 15:13:03 GMT 2010


The /var/log/mcelog is empty and according to Dell OpenManage, the ECC memory is all in good running order.

My test with the replacement binaries and a newly compiled binary still resulted in crashes. Also, disabling schedd_job_info had no affect.

The qmaster host is also a Lustre client, so we are stuck at RH kernel 2.6.18-128.7.1 for the short term since Lustre 1.8.1.1 requires that kernel. We plan to update some time this spring to 1.8.2, which supports the RHEL 5.4 kernel.

On a good note, the qmaster process has been running for approximately 36 hours since the last crash.


-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Friday, February 26, 2010 5:14 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] sgemaster keeps crashing 6.2u4

Am 25.02.2010 um 20:00 schrieb mhanby:

> The binaries are compiled from source using ge-V62u5_TAG-src.tar.gz
>
> The kernel is x86_64 2.6.18-128.7.1.el5 to support Lustre 1.8.1.1
>
> Can you think of any problem with swapping out the source compiled
> sge_qmaster with the one provided in the courtesy binary tar file?
> If not, I'll try that and see if it is any more / less stable.

It's at least worth a test. It could even be the version of the
compiler you used or one of the libs.

There are no known memory problems in this machines (it uses ECC RAM
and has an empty /var/log/mcelog)?

The really worst case I saw up to now was a delivered binary of an
application for quantum chemistry. It was working on all machines
expect two - the CPUs in these were the same but in two machines they
had a newer stepping. After we complained we got a fixed update and
the segfault was gone.

-- Reuti

PS: It's not part of this issue: RedHat is still at version 5 with
this version of the kernel being delivered up to today?


> Thanks, Mike
>
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Thursday, February 25, 2010 12:03 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] sgemaster keeps crashing 6.2u4
>
> Hi,
>
> Am 25.02.2010 um 18:57 schrieb mhanby:
>
>> Howdy, we upgraded the cluster to GE 6.2u5 and sge_qmaster
>> continues to crash.
>>
>> Is there any way to increase the messaging level in
>> $SGE_ROOT/$SGE_CELL/spool/qmaster/messages
>>
>> To maybe get a better idea of what may be leading up to the crash?
>
> in principle yes:
>
> loglevel                     log_info
>
> in SGE's configuration. Now you used the courtesy binaries from SUN/
> Oracle? Which kernel version is running in Rocks right now?
>
> -- Reuti
>
>
>>
>> I have schedd_job_info set to true, I'll try setting it to false to
>> see it makes a difference (although having it enabled is very handy
>> for debugging).
>>
>> -----Original Message-----
>> From: mhanby [mailto:mhanby at uab.edu]
>> Sent: Sunday, February 21, 2010 10:05 AM
>> To: users at gridengine.sunsource.net
>> Cc: users at gridengine.sunsource.net
>> Subject: Re: [GE users] sgemaster keeps crashing 6.2u4
>>
>> I may be able to, although this is a Rocks cluster so I'll have to
>> think about how to do it and still keep the compute nodes in sink
>>
>> On Feb 21, 2010, at 6:44, "reuti" <reuti at staff.uni-marburg.de> wrote:
>>
>>> Hi,
>>>
>>> Am 19.02.2010 um 16:42 schrieb mhanby:
>>>
>>>> For some reason, this time sge_qmaster is on a roll, 5 more crashes
>>>> since 3PM yesterday
>>>>
>>>> sge_qmaster[5004] general protection rip:58066d rsp:487b38b0
>>>> error:0
>>>> sge_qmaster[10453]: segfault at 00002aaa0000001f rip
>>>> 000000000058066d rsp 00000000482c48b0 error 4
>>>> sge_qmaster[11800]: segfault at 0000000000000070 rip
>>>> 0000000000580b1d rsp 00000000481b9a70 error 4
>>>> sge_qmaster[543]: segfault at 000000006565726a rip 0000000000580b1d
>>>> rsp 000000004853ca70 error 4
>>>> sge_qmaster[7349]: segfault at 000000000000000f rip
>>>> 000000000058066d rsp 000000004858c8b0 error 4
>>>> sge_qmaster[8148] general protection rip:58066d rsp:482668b0
>>>> error:0
>>>> sge_qmaster[800]: segfault at 000000000000001e rip 000000000058066d
>>>> rsp 0000000047a0b8b0 error 4
>>>>
>>>> I now have Nagios monitoring the sge_qmaster processes to alert me
>>>> when it's missing.
>>>>
>>>> What is the affect on the existing 'running' jobs if they complete
>>>> during the time when qmaster is dead? Will the jobs be reported as
>>>> completing successfully once qmaster starts and processes the
>>>> backlog?
>>>>
>>>> -----Original Message-----
>>>> From: mhanby [mailto:mhanby at uab.edu]
>>>> Sent: Thursday, February 18, 2010 3:00 PM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: [GE users] sgemaster keeps crashing 6.2u4
>>>
>>> I forgot: can you try to upgrade to 6.2u5?
>>>
>>> -- Reuti
>>>
>>>
>>>> Howdy,
>>>>
>>>> I have GE 6.2u4 installed on a CentOS 5.4 x86_64 server. sgemaster
>>>> keeps crashing on this machine following a reboot:
>>>>
>>>> Here are 3 crashes over the past couple of hours (from the dmesg
>>>> log):
>>>>
>>>> sge_qmaster[5004] general protection rip:58066d rsp:487b38b0
>>>> error:0
>>>> sge_qmaster[10453]: segfault at 00002aaa0000001f rip
>>>> 000000000058066d rsp 00000000482c48b0 error 4
>>>> sge_qmaster[11800]: segfault at 0000000000000070 rip
>>>> 0000000000580b1d rsp 00000000481b9a70 error 4
>>>>
>>>> And this is what is logged in $SGE_ROOT/$SGE_CELL/spool/qmaster/
>>>> messages
>>>>
>>>> 02/18/2010 11:22:39|  main|cluster1|I|read job database with 40
>>>> entries in 0 seconds
>>>> 02/18/2010 11:22:39|  main|cluster1|E|error opening file "/opt/
>>>> gridengine/default/spool/qmaster/./sharetree" for reading: No such
>>>> file or directory
>>>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster hard descriptor limit
>>>> is set to 8192
>>>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster soft descriptor limit
>>>> is set to 8192
>>>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster will use max. 8172
>>>> file descriptors for communication
>>>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster will accept max. 99
>>>> dynamic event clients
>>>> 02/18/2010 11:22:39|  main|cluster1|I|starting up GE 6.2u4 (lx26-
>>>> amd64)
>>>> 02/18/2010 11:39:28|  main|cluster1|I|read job database with 39
>>>> entries in 0 seconds
>>>> 02/18/2010 11:39:28|  main|cluster1|E|error opening file "/opt/
>>>> gridengine/default/spool/qmaster/./sharetree" for reading: No such
>>>> file or directory
>>>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster hard descriptor limit
>>>> is set to 8192
>>>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster soft descriptor limit
>>>> is set to 8192
>>>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster will use max. 8172
>>>> file descriptors for communication
>>>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster will accept max. 99
>>>> dynamic event clients
>>>> 02/18/2010 11:39:28|  main|cluster1|I|starting up GE 6.2u4 (lx26-
>>>> amd64)
>>>> 02/18/2010 11:39:28|worker|cluster1|W|rule "default rule (spool
>>>> dir)" in spooling context "flatfile spooling" failed writing an
>>>> object
>>>> 02/18/2010 14:41:51|  main|cluster1|I|read job database with 42
>>>> entries in 0 seconds
>>>> 02/18/2010 14:41:51|  main|cluster1|E|error opening file "/opt/
>>>> gridengine/default/spool/qmaster/./sharetree" for reading: No such
>>>> file or directory
>>>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster hard descriptor limit
>>>> is set to 8192
>>>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster soft descriptor limit
>>>> is set to 8192
>>>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster will use max. 8172
>>>> file descriptors for communication
>>>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster will accept max. 99
>>>> dynamic event clients
>>>> 02/18/2010 14:41:51|  main|cluster1|I|starting up GE 6.2u4 (lx26-
>>>> amd64)
>>>>
>>>>
>>>> Following previous reboots where this occurred, eventually it would
>>>> stabilize and remain running for weeks.
>>>>
>>>> Any ideas what may be causing sgemaster to segfault ?
>>>>
>>>> Thanks, Mike
>>>>
>>>> =================================
>>>> Mike Hanby
>>>> mhanby at uab.edu
>>>> Information Systems Specialist II
>>>> IT HPCS / Research Computing
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=245098
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=245179
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=245328
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=245341
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=246086
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?
> dsForumId=38&dsMessageId=246089
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?
> dsForumId=38&dsMessageId=246097
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=246174

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=246535

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list