[GE users] qmaster SEGVs

heywood heywood at cshl.edu
Tue May 4 18:51:08 BST 2010


Below is part of .../spool/qmaster/messages for 3 qmaster starts and
crashes. No info there, but /var/log/messages shows the seg faults.

There were no tightly integrated parallel jobs at the time.

Todd

---

05/03/2010 12:01:02|  main|bhmnode2|I|starting up GE 6.2u5 (lx24-amd64)
05/03/2010 12:01:03|worker|bhmnode2|W|rule "default rule (spool dir)" in
spooling context "flatfile spooling" failed writing an object
05/03/2010 12:01:03|worker|bhmnode2|W|rule "default rule (spool dir)" in
spooling context "flatfile spooling" failed writing an object
05/03/2010 12:04:00|  main|bhmnode2|I|read job database with 1335 entries in
1 seconds
05/03/2010 12:04:03|  main|bhmnode2|E|error opening file
"/opt/sge/default/spool/qmaster/./sharetree" for reading: No such file or
directory
05/03/2010 12:04:03|  main|bhmnode2|I|qmaster hard descriptor limit is set
to 65536
05/03/2010 12:04:03|  main|bhmnode2|I|qmaster soft descriptor limit is set
to 65536
05/03/2010 12:04:03|  main|bhmnode2|I|qmaster will use max. 65516 file
descriptors for communication
05/03/2010 12:04:03|  main|bhmnode2|I|qmaster will accept max. 99 dynamic
event clients
05/03/2010 12:04:03|  main|bhmnode2|I|starting up GE 6.2u5 (lx24-amd64)
05/03/2010 12:04:04|worker|bhmnode2|W|rule "default rule (spool dir)" in
spooling context "flatfile spooling" failed writing an object
05/03/2010 12:04:04|worker|bhmnode2|W|rule "default rule (spool dir)" in
spooling context "flatfile spooling" failed writing an object
05/03/2010 12:04:04|worker|bhmnode2|W|rule "default rule (spool dir)" in
spooling context "flatfile spooling" failed writing an object
05/03/2010 12:05:57|  main|bhmnode2|I|read job database with 1333 entries in
1 seconds
05/03/2010 12:05:59|  main|bhmnode2|E|error opening file
"/opt/sge/default/spool/qmaster/./sharetree" for reading: No such file or
directory
05/03/2010 12:05:59|  main|bhmnode2|I|qmaster hard descriptor limit is set
to 65536
05/03/2010 12:05:59|  main|bhmnode2|I|qmaster soft descriptor limit is set
to 65536
05/03/2010 12:05:59|  main|bhmnode2|I|qmaster will use max. 65516 file
descriptors for communication
05/03/2010 12:05:59|  main|bhmnode2|I|qmaster will accept max. 99 dynamic
event clients
05/03/2010 12:05:59|  main|bhmnode2|I|starting up GE 6.2u5 (lx24-amd64)




On 5/4/10 10:25 AM, "andy" <andy.schwierskott at sun.com> wrote:

> Hi,
> 
> Do you have PE jobs running when this happens? Is it tightly integrated
> parallel jobs?
> 
> What messages do you see in the qmaster messages file (or in
> /tmp/qmaster_messages.<pid>)?
> 
> Andy
> 
> 
> 
> On Tue, 4 May 2010, mhanby wrote:
> 
>> I haven't found any solution. My SEGV happened in 6.2u4 and after upgrading
>> to 6.2u5 continued.
>> 
>> For me, it seems to always happen following a reboot. After several crashes,
>> it seems to stabilize for a while (days, weeks) before it starts again.
>> 
>> My workaround is to use Nagios and event handlers to start it back up if it
>> isn't running.
>> 
>> -----Original Message-----
>> From: heywood [mailto:heywood at cshl.edu]
>> Sent: Monday, May 03, 2010 12:51 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] qmaster SEGVs
>> 
>> We rebooted the node running qmaster, and we are now also getting qmaster
>> crashes. I see in the archive there is another thread "sgemaster keeps
>> crashing 6.2u4" from February which apparently is the same issue. After a
>> number of crashes I got qmaster to keep running (for now!).
>> 
>> We are running 6.2u5 with RHEL4.
>> 
>> I guess there is no solution/resolution?
>> 
>> Todd
>> 
>> 
>> sge_qmaster[5851]: segfault at 0000000000000080 rip 00000039fa470560 rsp
>> 000000004780aa38 error 4
>> sge_qmaster[6163]: segfault at 0000000000000080 rip 00000039fa470560 rsp
>> 000000004780aa38 error 4
>> sge_qmaster[6573]: segfault at 0000000000000000 rip 00000000005bf6c7 rsp
>> 0000000047809ec0 error 4
>> 
>> On 3/17/10 12:14 PM, "abrookfield" <a.brookfield at sheffield.ac.uk> wrote:
>> 
>>> I'm also having problems with qmaster SEGVs in 6.2u5, running on RHEL5,
>>> x86_64.
>>> 
>>> Crashes seem to be correlated with users deleting jobs, particularly (but
>>> not
>>> exclusively) OpenMPI parallel jobs which have been running for 'a while'.
>>> Other than updating to u5 we've not made any config changes to our setup.
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=2
>>> 49
>>> 186
>>> 
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=25
>> 5955
>> 
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=25
>> 6103
>> 
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>> 
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256
> 104
> 
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256135

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list