[GE users] sgemaster keeps crashing 6.2u4

mhanby mhanby at uab.edu
Fri Feb 26 20:09:56 GMT 2010


I tried recompiling the source on a system running the same kernel as a test. The resulting sge_qmaster still crashes. Oh well, shot in the dark...

I'm now testing it with 'schedd_job_info false' to see if it still crashes over time.

Mike

-----Original Message-----
From: mhanby [mailto:mhanby at uab.edu] 
Sent: Thursday, February 25, 2010 1:00 PM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] sgemaster keeps crashing 6.2u4

The binaries are compiled from source using ge-V62u5_TAG-src.tar.gz

The kernel is x86_64 2.6.18-128.7.1.el5 to support Lustre 1.8.1.1

Can you think of any problem with swapping out the source compiled sge_qmaster with the one provided in the courtesy binary tar file? If not, I'll try that and see if it is any more / less stable.

Thanks, Mike

-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Thursday, February 25, 2010 12:03 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] sgemaster keeps crashing 6.2u4

Hi,

Am 25.02.2010 um 18:57 schrieb mhanby:

> Howdy, we upgraded the cluster to GE 6.2u5 and sge_qmaster  
> continues to crash.
>
> Is there any way to increase the messaging level in
> $SGE_ROOT/$SGE_CELL/spool/qmaster/messages
>
> To maybe get a better idea of what may be leading up to the crash?

in principle yes:

loglevel                     log_info

in SGE's configuration. Now you used the courtesy binaries from SUN/ 
Oracle? Which kernel version is running in Rocks right now?

-- Reuti


>
> I have schedd_job_info set to true, I'll try setting it to false to  
> see it makes a difference (although having it enabled is very handy  
> for debugging).
>
> -----Original Message-----
> From: mhanby [mailto:mhanby at uab.edu]
> Sent: Sunday, February 21, 2010 10:05 AM
> To: users at gridengine.sunsource.net
> Cc: users at gridengine.sunsource.net
> Subject: Re: [GE users] sgemaster keeps crashing 6.2u4
>
> I may be able to, although this is a Rocks cluster so I'll have to
> think about how to do it and still keep the compute nodes in sink
>
> On Feb 21, 2010, at 6:44, "reuti" <reuti at staff.uni-marburg.de> wrote:
>
>> Hi,
>>
>> Am 19.02.2010 um 16:42 schrieb mhanby:
>>
>>> For some reason, this time sge_qmaster is on a roll, 5 more crashes
>>> since 3PM yesterday
>>>
>>> sge_qmaster[5004] general protection rip:58066d rsp:487b38b0 error:0
>>> sge_qmaster[10453]: segfault at 00002aaa0000001f rip
>>> 000000000058066d rsp 00000000482c48b0 error 4
>>> sge_qmaster[11800]: segfault at 0000000000000070 rip
>>> 0000000000580b1d rsp 00000000481b9a70 error 4
>>> sge_qmaster[543]: segfault at 000000006565726a rip 0000000000580b1d
>>> rsp 000000004853ca70 error 4
>>> sge_qmaster[7349]: segfault at 000000000000000f rip
>>> 000000000058066d rsp 000000004858c8b0 error 4
>>> sge_qmaster[8148] general protection rip:58066d rsp:482668b0 error:0
>>> sge_qmaster[800]: segfault at 000000000000001e rip 000000000058066d
>>> rsp 0000000047a0b8b0 error 4
>>>
>>> I now have Nagios monitoring the sge_qmaster processes to alert me
>>> when it's missing.
>>>
>>> What is the affect on the existing 'running' jobs if they complete
>>> during the time when qmaster is dead? Will the jobs be reported as
>>> completing successfully once qmaster starts and processes the
>>> backlog?
>>>
>>> -----Original Message-----
>>> From: mhanby [mailto:mhanby at uab.edu]
>>> Sent: Thursday, February 18, 2010 3:00 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: [GE users] sgemaster keeps crashing 6.2u4
>>
>> I forgot: can you try to upgrade to 6.2u5?
>>
>> -- Reuti
>>
>>
>>> Howdy,
>>>
>>> I have GE 6.2u4 installed on a CentOS 5.4 x86_64 server. sgemaster
>>> keeps crashing on this machine following a reboot:
>>>
>>> Here are 3 crashes over the past couple of hours (from the dmesg
>>> log):
>>>
>>> sge_qmaster[5004] general protection rip:58066d rsp:487b38b0 error:0
>>> sge_qmaster[10453]: segfault at 00002aaa0000001f rip
>>> 000000000058066d rsp 00000000482c48b0 error 4
>>> sge_qmaster[11800]: segfault at 0000000000000070 rip
>>> 0000000000580b1d rsp 00000000481b9a70 error 4
>>>
>>> And this is what is logged in $SGE_ROOT/$SGE_CELL/spool/qmaster/
>>> messages
>>>
>>> 02/18/2010 11:22:39|  main|cluster1|I|read job database with 40
>>> entries in 0 seconds
>>> 02/18/2010 11:22:39|  main|cluster1|E|error opening file "/opt/
>>> gridengine/default/spool/qmaster/./sharetree" for reading: No such
>>> file or directory
>>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster hard descriptor limit
>>> is set to 8192
>>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster soft descriptor limit
>>> is set to 8192
>>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster will use max. 8172
>>> file descriptors for communication
>>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster will accept max. 99
>>> dynamic event clients
>>> 02/18/2010 11:22:39|  main|cluster1|I|starting up GE 6.2u4 (lx26-
>>> amd64)
>>> 02/18/2010 11:39:28|  main|cluster1|I|read job database with 39
>>> entries in 0 seconds
>>> 02/18/2010 11:39:28|  main|cluster1|E|error opening file "/opt/
>>> gridengine/default/spool/qmaster/./sharetree" for reading: No such
>>> file or directory
>>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster hard descriptor limit
>>> is set to 8192
>>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster soft descriptor limit
>>> is set to 8192
>>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster will use max. 8172
>>> file descriptors for communication
>>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster will accept max. 99
>>> dynamic event clients
>>> 02/18/2010 11:39:28|  main|cluster1|I|starting up GE 6.2u4 (lx26-
>>> amd64)
>>> 02/18/2010 11:39:28|worker|cluster1|W|rule "default rule (spool
>>> dir)" in spooling context "flatfile spooling" failed writing an
>>> object
>>> 02/18/2010 14:41:51|  main|cluster1|I|read job database with 42
>>> entries in 0 seconds
>>> 02/18/2010 14:41:51|  main|cluster1|E|error opening file "/opt/
>>> gridengine/default/spool/qmaster/./sharetree" for reading: No such
>>> file or directory
>>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster hard descriptor limit
>>> is set to 8192
>>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster soft descriptor limit
>>> is set to 8192
>>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster will use max. 8172
>>> file descriptors for communication
>>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster will accept max. 99
>>> dynamic event clients
>>> 02/18/2010 14:41:51|  main|cluster1|I|starting up GE 6.2u4 (lx26-
>>> amd64)
>>>
>>>
>>> Following previous reboots where this occurred, eventually it would
>>> stabilize and remain running for weeks.
>>>
>>> Any ideas what may be causing sgemaster to segfault ?
>>>
>>> Thanks, Mike
>>>
>>> =================================
>>> Mike Hanby
>>> mhanby at uab.edu
>>> Information Systems Specialist II
>>> IT HPCS / Research Computing
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=245098
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=245179
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=245328
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=245341
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=246086
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=246089

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=246097

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=246234

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list