[GE users] sgemaster keeps crashing 6.2u4

reuti reuti at staff.uni-marburg.de
Sun Feb 21 12:46:50 GMT 2010


Hi,

Am 19.02.2010 um 16:42 schrieb mhanby:

> For some reason, this time sge_qmaster is on a roll, 5 more crashes  
> since 3PM yesterday
>
> sge_qmaster[5004] general protection rip:58066d rsp:487b38b0 error:0
> sge_qmaster[10453]: segfault at 00002aaa0000001f rip  
> 000000000058066d rsp 00000000482c48b0 error 4
> sge_qmaster[11800]: segfault at 0000000000000070 rip  
> 0000000000580b1d rsp 00000000481b9a70 error 4
> sge_qmaster[543]: segfault at 000000006565726a rip 0000000000580b1d  
> rsp 000000004853ca70 error 4
> sge_qmaster[7349]: segfault at 000000000000000f rip  
> 000000000058066d rsp 000000004858c8b0 error 4
> sge_qmaster[8148] general protection rip:58066d rsp:482668b0 error:0
> sge_qmaster[800]: segfault at 000000000000001e rip 000000000058066d  
> rsp 0000000047a0b8b0 error 4
>
> I now have Nagios monitoring the sge_qmaster processes to alert me  
> when it's missing.
>
> What is the affect on the existing 'running' jobs if they complete  
> during the time when qmaster is dead? Will the jobs be reported as  
> completing successfully once qmaster starts and processes the backlog?
>
> -----Original Message-----
> From: mhanby [mailto:mhanby at uab.edu]
> Sent: Thursday, February 18, 2010 3:00 PM
> To: users at gridengine.sunsource.net
> Subject: [GE users] sgemaster keeps crashing 6.2u4

I forgot: can you try to upgrade to 6.2u5?

-- Reuti


> Howdy,
>
> I have GE 6.2u4 installed on a CentOS 5.4 x86_64 server. sgemaster  
> keeps crashing on this machine following a reboot:
>
> Here are 3 crashes over the past couple of hours (from the dmesg log):
>
> sge_qmaster[5004] general protection rip:58066d rsp:487b38b0 error:0
> sge_qmaster[10453]: segfault at 00002aaa0000001f rip  
> 000000000058066d rsp 00000000482c48b0 error 4
> sge_qmaster[11800]: segfault at 0000000000000070 rip  
> 0000000000580b1d rsp 00000000481b9a70 error 4
>
> And this is what is logged in $SGE_ROOT/$SGE_CELL/spool/qmaster/ 
> messages
>
> 02/18/2010 11:22:39|  main|cluster1|I|read job database with 40  
> entries in 0 seconds
> 02/18/2010 11:22:39|  main|cluster1|E|error opening file "/opt/ 
> gridengine/default/spool/qmaster/./sharetree" for reading: No such  
> file or directory
> 02/18/2010 11:22:39|  main|cluster1|I|qmaster hard descriptor limit  
> is set to 8192
> 02/18/2010 11:22:39|  main|cluster1|I|qmaster soft descriptor limit  
> is set to 8192
> 02/18/2010 11:22:39|  main|cluster1|I|qmaster will use max. 8172  
> file descriptors for communication
> 02/18/2010 11:22:39|  main|cluster1|I|qmaster will accept max. 99  
> dynamic event clients
> 02/18/2010 11:22:39|  main|cluster1|I|starting up GE 6.2u4 (lx26- 
> amd64)
> 02/18/2010 11:39:28|  main|cluster1|I|read job database with 39  
> entries in 0 seconds
> 02/18/2010 11:39:28|  main|cluster1|E|error opening file "/opt/ 
> gridengine/default/spool/qmaster/./sharetree" for reading: No such  
> file or directory
> 02/18/2010 11:39:28|  main|cluster1|I|qmaster hard descriptor limit  
> is set to 8192
> 02/18/2010 11:39:28|  main|cluster1|I|qmaster soft descriptor limit  
> is set to 8192
> 02/18/2010 11:39:28|  main|cluster1|I|qmaster will use max. 8172  
> file descriptors for communication
> 02/18/2010 11:39:28|  main|cluster1|I|qmaster will accept max. 99  
> dynamic event clients
> 02/18/2010 11:39:28|  main|cluster1|I|starting up GE 6.2u4 (lx26- 
> amd64)
> 02/18/2010 11:39:28|worker|cluster1|W|rule "default rule (spool  
> dir)" in spooling context "flatfile spooling" failed writing an object
> 02/18/2010 14:41:51|  main|cluster1|I|read job database with 42  
> entries in 0 seconds
> 02/18/2010 14:41:51|  main|cluster1|E|error opening file "/opt/ 
> gridengine/default/spool/qmaster/./sharetree" for reading: No such  
> file or directory
> 02/18/2010 14:41:51|  main|cluster1|I|qmaster hard descriptor limit  
> is set to 8192
> 02/18/2010 14:41:51|  main|cluster1|I|qmaster soft descriptor limit  
> is set to 8192
> 02/18/2010 14:41:51|  main|cluster1|I|qmaster will use max. 8172  
> file descriptors for communication
> 02/18/2010 14:41:51|  main|cluster1|I|qmaster will accept max. 99  
> dynamic event clients
> 02/18/2010 14:41:51|  main|cluster1|I|starting up GE 6.2u4 (lx26- 
> amd64)
>
>
> Following previous reboots where this occurred, eventually it would  
> stabilize and remain running for weeks.
>
> Any ideas what may be causing sgemaster to segfault ?
>
> Thanks, Mike
>
> =================================
> Mike Hanby
> mhanby at uab.edu
> Information Systems Specialist II
> IT HPCS / Research Computing
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=245098
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=245179
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245328

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list