[GE users] sgemaster keeps crashing 6.2u4

mhanby mhanby at uab.edu
Thu Feb 25 17:57:04 GMT 2010


Howdy, we upgraded the cluster to GE 6.2u5 and sge_qmaster continues to crash.

Is there any way to increase the messaging level in
$SGE_ROOT/$SGE_CELL/spool/qmaster/messages

To maybe get a better idea of what may be leading up to the crash?

I have schedd_job_info set to true, I'll try setting it to false to see it makes a difference (although having it enabled is very handy for debugging).

-----Original Message-----
From: mhanby [mailto:mhanby at uab.edu] 
Sent: Sunday, February 21, 2010 10:05 AM
To: users at gridengine.sunsource.net
Cc: users at gridengine.sunsource.net
Subject: Re: [GE users] sgemaster keeps crashing 6.2u4

I may be able to, although this is a Rocks cluster so I'll have to  
think about how to do it and still keep the compute nodes in sink

On Feb 21, 2010, at 6:44, "reuti" <reuti at staff.uni-marburg.de> wrote:

> Hi,
>
> Am 19.02.2010 um 16:42 schrieb mhanby:
>
>> For some reason, this time sge_qmaster is on a roll, 5 more crashes
>> since 3PM yesterday
>>
>> sge_qmaster[5004] general protection rip:58066d rsp:487b38b0 error:0
>> sge_qmaster[10453]: segfault at 00002aaa0000001f rip
>> 000000000058066d rsp 00000000482c48b0 error 4
>> sge_qmaster[11800]: segfault at 0000000000000070 rip
>> 0000000000580b1d rsp 00000000481b9a70 error 4
>> sge_qmaster[543]: segfault at 000000006565726a rip 0000000000580b1d
>> rsp 000000004853ca70 error 4
>> sge_qmaster[7349]: segfault at 000000000000000f rip
>> 000000000058066d rsp 000000004858c8b0 error 4
>> sge_qmaster[8148] general protection rip:58066d rsp:482668b0 error:0
>> sge_qmaster[800]: segfault at 000000000000001e rip 000000000058066d
>> rsp 0000000047a0b8b0 error 4
>>
>> I now have Nagios monitoring the sge_qmaster processes to alert me
>> when it's missing.
>>
>> What is the affect on the existing 'running' jobs if they complete
>> during the time when qmaster is dead? Will the jobs be reported as
>> completing successfully once qmaster starts and processes the  
>> backlog?
>>
>> -----Original Message-----
>> From: mhanby [mailto:mhanby at uab.edu]
>> Sent: Thursday, February 18, 2010 3:00 PM
>> To: users at gridengine.sunsource.net
>> Subject: [GE users] sgemaster keeps crashing 6.2u4
>
> I forgot: can you try to upgrade to 6.2u5?
>
> -- Reuti
>
>
>> Howdy,
>>
>> I have GE 6.2u4 installed on a CentOS 5.4 x86_64 server. sgemaster
>> keeps crashing on this machine following a reboot:
>>
>> Here are 3 crashes over the past couple of hours (from the dmesg  
>> log):
>>
>> sge_qmaster[5004] general protection rip:58066d rsp:487b38b0 error:0
>> sge_qmaster[10453]: segfault at 00002aaa0000001f rip
>> 000000000058066d rsp 00000000482c48b0 error 4
>> sge_qmaster[11800]: segfault at 0000000000000070 rip
>> 0000000000580b1d rsp 00000000481b9a70 error 4
>>
>> And this is what is logged in $SGE_ROOT/$SGE_CELL/spool/qmaster/
>> messages
>>
>> 02/18/2010 11:22:39|  main|cluster1|I|read job database with 40
>> entries in 0 seconds
>> 02/18/2010 11:22:39|  main|cluster1|E|error opening file "/opt/
>> gridengine/default/spool/qmaster/./sharetree" for reading: No such
>> file or directory
>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster hard descriptor limit
>> is set to 8192
>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster soft descriptor limit
>> is set to 8192
>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster will use max. 8172
>> file descriptors for communication
>> 02/18/2010 11:22:39|  main|cluster1|I|qmaster will accept max. 99
>> dynamic event clients
>> 02/18/2010 11:22:39|  main|cluster1|I|starting up GE 6.2u4 (lx26-
>> amd64)
>> 02/18/2010 11:39:28|  main|cluster1|I|read job database with 39
>> entries in 0 seconds
>> 02/18/2010 11:39:28|  main|cluster1|E|error opening file "/opt/
>> gridengine/default/spool/qmaster/./sharetree" for reading: No such
>> file or directory
>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster hard descriptor limit
>> is set to 8192
>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster soft descriptor limit
>> is set to 8192
>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster will use max. 8172
>> file descriptors for communication
>> 02/18/2010 11:39:28|  main|cluster1|I|qmaster will accept max. 99
>> dynamic event clients
>> 02/18/2010 11:39:28|  main|cluster1|I|starting up GE 6.2u4 (lx26-
>> amd64)
>> 02/18/2010 11:39:28|worker|cluster1|W|rule "default rule (spool
>> dir)" in spooling context "flatfile spooling" failed writing an  
>> object
>> 02/18/2010 14:41:51|  main|cluster1|I|read job database with 42
>> entries in 0 seconds
>> 02/18/2010 14:41:51|  main|cluster1|E|error opening file "/opt/
>> gridengine/default/spool/qmaster/./sharetree" for reading: No such
>> file or directory
>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster hard descriptor limit
>> is set to 8192
>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster soft descriptor limit
>> is set to 8192
>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster will use max. 8172
>> file descriptors for communication
>> 02/18/2010 14:41:51|  main|cluster1|I|qmaster will accept max. 99
>> dynamic event clients
>> 02/18/2010 14:41:51|  main|cluster1|I|starting up GE 6.2u4 (lx26-
>> amd64)
>>
>>
>> Following previous reboots where this occurred, eventually it would
>> stabilize and remain running for weeks.
>>
>> Any ideas what may be causing sgemaster to segfault ?
>>
>> Thanks, Mike
>>
>> =================================
>> Mike Hanby
>> mhanby at uab.edu
>> Information Systems Specialist II
>> IT HPCS / Research Computing
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=245098
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=245179
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245328
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245341

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=246086

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list