[GE users] sgemaster keeps crashing 6.2u4

mhanby mhanby at uab.edu
Fri Feb 19 15:42:37 GMT 2010


For some reason, this time sge_qmaster is on a roll, 5 more crashes since 3PM yesterday

sge_qmaster[5004] general protection rip:58066d rsp:487b38b0 error:0
sge_qmaster[10453]: segfault at 00002aaa0000001f rip 000000000058066d rsp 00000000482c48b0 error 4
sge_qmaster[11800]: segfault at 0000000000000070 rip 0000000000580b1d rsp 00000000481b9a70 error 4
sge_qmaster[543]: segfault at 000000006565726a rip 0000000000580b1d rsp 000000004853ca70 error 4
sge_qmaster[7349]: segfault at 000000000000000f rip 000000000058066d rsp 000000004858c8b0 error 4
sge_qmaster[8148] general protection rip:58066d rsp:482668b0 error:0
sge_qmaster[800]: segfault at 000000000000001e rip 000000000058066d rsp 0000000047a0b8b0 error 4

I now have Nagios monitoring the sge_qmaster processes to alert me when it's missing.

What is the affect on the existing 'running' jobs if they complete during the time when qmaster is dead? Will the jobs be reported as completing successfully once qmaster starts and processes the backlog?

-----Original Message-----
From: mhanby [mailto:mhanby at uab.edu] 
Sent: Thursday, February 18, 2010 3:00 PM
To: users at gridengine.sunsource.net
Subject: [GE users] sgemaster keeps crashing 6.2u4

Howdy,

I have GE 6.2u4 installed on a CentOS 5.4 x86_64 server. sgemaster keeps crashing on this machine following a reboot:

Here are 3 crashes over the past couple of hours (from the dmesg log):

sge_qmaster[5004] general protection rip:58066d rsp:487b38b0 error:0
sge_qmaster[10453]: segfault at 00002aaa0000001f rip 000000000058066d rsp 00000000482c48b0 error 4
sge_qmaster[11800]: segfault at 0000000000000070 rip 0000000000580b1d rsp 00000000481b9a70 error 4

And this is what is logged in $SGE_ROOT/$SGE_CELL/spool/qmaster/messages

02/18/2010 11:22:39|  main|cluster1|I|read job database with 40 entries in 0 seconds
02/18/2010 11:22:39|  main|cluster1|E|error opening file "/opt/gridengine/default/spool/qmaster/./sharetree" for reading: No such file or directory
02/18/2010 11:22:39|  main|cluster1|I|qmaster hard descriptor limit is set to 8192
02/18/2010 11:22:39|  main|cluster1|I|qmaster soft descriptor limit is set to 8192
02/18/2010 11:22:39|  main|cluster1|I|qmaster will use max. 8172 file descriptors for communication
02/18/2010 11:22:39|  main|cluster1|I|qmaster will accept max. 99 dynamic event clients
02/18/2010 11:22:39|  main|cluster1|I|starting up GE 6.2u4 (lx26-amd64)
02/18/2010 11:39:28|  main|cluster1|I|read job database with 39 entries in 0 seconds
02/18/2010 11:39:28|  main|cluster1|E|error opening file "/opt/gridengine/default/spool/qmaster/./sharetree" for reading: No such file or directory
02/18/2010 11:39:28|  main|cluster1|I|qmaster hard descriptor limit is set to 8192
02/18/2010 11:39:28|  main|cluster1|I|qmaster soft descriptor limit is set to 8192
02/18/2010 11:39:28|  main|cluster1|I|qmaster will use max. 8172 file descriptors for communication
02/18/2010 11:39:28|  main|cluster1|I|qmaster will accept max. 99 dynamic event clients
02/18/2010 11:39:28|  main|cluster1|I|starting up GE 6.2u4 (lx26-amd64)
02/18/2010 11:39:28|worker|cluster1|W|rule "default rule (spool dir)" in spooling context "flatfile spooling" failed writing an object
02/18/2010 14:41:51|  main|cluster1|I|read job database with 42 entries in 0 seconds
02/18/2010 14:41:51|  main|cluster1|E|error opening file "/opt/gridengine/default/spool/qmaster/./sharetree" for reading: No such file or directory
02/18/2010 14:41:51|  main|cluster1|I|qmaster hard descriptor limit is set to 8192
02/18/2010 14:41:51|  main|cluster1|I|qmaster soft descriptor limit is set to 8192
02/18/2010 14:41:51|  main|cluster1|I|qmaster will use max. 8172 file descriptors for communication
02/18/2010 14:41:51|  main|cluster1|I|qmaster will accept max. 99 dynamic event clients
02/18/2010 14:41:51|  main|cluster1|I|starting up GE 6.2u4 (lx26-amd64)


Following previous reboots where this occurred, eventually it would stabilize and remain running for weeks.

Any ideas what may be causing sgemaster to segfault ?

Thanks, Mike 

=================================
Mike Hanby
mhanby at uab.edu
Information Systems Specialist II
IT HPCS / Research Computing

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245098

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245179

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list