[GE users] sgemaster keeps crashing 6.2u4

mhanby mhanby at uab.edu
Thu Feb 18 21:00:14 GMT 2010


Howdy,

I have GE 6.2u4 installed on a CentOS 5.4 x86_64 server. sgemaster keeps crashing on this machine following a reboot:

Here are 3 crashes over the past couple of hours (from the dmesg log):

sge_qmaster[5004] general protection rip:58066d rsp:487b38b0 error:0
sge_qmaster[10453]: segfault at 00002aaa0000001f rip 000000000058066d rsp 00000000482c48b0 error 4
sge_qmaster[11800]: segfault at 0000000000000070 rip 0000000000580b1d rsp 00000000481b9a70 error 4

And this is what is logged in $SGE_ROOT/$SGE_CELL/spool/qmaster/messages

02/18/2010 11:22:39|  main|cluster1|I|read job database with 40 entries in 0 seconds
02/18/2010 11:22:39|  main|cluster1|E|error opening file "/opt/gridengine/default/spool/qmaster/./sharetree" for reading: No such file or directory
02/18/2010 11:22:39|  main|cluster1|I|qmaster hard descriptor limit is set to 8192
02/18/2010 11:22:39|  main|cluster1|I|qmaster soft descriptor limit is set to 8192
02/18/2010 11:22:39|  main|cluster1|I|qmaster will use max. 8172 file descriptors for communication
02/18/2010 11:22:39|  main|cluster1|I|qmaster will accept max. 99 dynamic event clients
02/18/2010 11:22:39|  main|cluster1|I|starting up GE 6.2u4 (lx26-amd64)
02/18/2010 11:39:28|  main|cluster1|I|read job database with 39 entries in 0 seconds
02/18/2010 11:39:28|  main|cluster1|E|error opening file "/opt/gridengine/default/spool/qmaster/./sharetree" for reading: No such file or directory
02/18/2010 11:39:28|  main|cluster1|I|qmaster hard descriptor limit is set to 8192
02/18/2010 11:39:28|  main|cluster1|I|qmaster soft descriptor limit is set to 8192
02/18/2010 11:39:28|  main|cluster1|I|qmaster will use max. 8172 file descriptors for communication
02/18/2010 11:39:28|  main|cluster1|I|qmaster will accept max. 99 dynamic event clients
02/18/2010 11:39:28|  main|cluster1|I|starting up GE 6.2u4 (lx26-amd64)
02/18/2010 11:39:28|worker|cluster1|W|rule "default rule (spool dir)" in spooling context "flatfile spooling" failed writing an object
02/18/2010 14:41:51|  main|cluster1|I|read job database with 42 entries in 0 seconds
02/18/2010 14:41:51|  main|cluster1|E|error opening file "/opt/gridengine/default/spool/qmaster/./sharetree" for reading: No such file or directory
02/18/2010 14:41:51|  main|cluster1|I|qmaster hard descriptor limit is set to 8192
02/18/2010 14:41:51|  main|cluster1|I|qmaster soft descriptor limit is set to 8192
02/18/2010 14:41:51|  main|cluster1|I|qmaster will use max. 8172 file descriptors for communication
02/18/2010 14:41:51|  main|cluster1|I|qmaster will accept max. 99 dynamic event clients
02/18/2010 14:41:51|  main|cluster1|I|starting up GE 6.2u4 (lx26-amd64)


Following previous reboots where this occurred, eventually it would stabilize and remain running for weeks.

Any ideas what may be causing sgemaster to segfault ?

Thanks, Mike 

=================================
Mike Hanby
mhanby at uab.edu
Information Systems Specialist II
IT HPCS / Research Computing

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245098

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list