[GE users] qmaster SEGVs

dom marco.donauer at sun.com
Wed May 5 06:27:59 BST 2010


Andy,

I already started with investigations on this issue.
Most of the setups are different. My first step was to find parallels or
settings which have been made
for all segfaulting clusters. But there is one this which appears again
and again: cluster is under load (normal laod), jobs finishing and OS is
any Enterprise Linux.
I already got core dumps and logfiles from the community users but
currently I hadn't enough time to investigate them.

Marco

Am 04.05.2010 16:25, schrieb andy:
> Hi,
>
> Do you have PE jobs running when this happens? Is it tightly integrated
> parallel jobs?
>
> What messages do you see in the qmaster messages file (or in
> /tmp/qmaster_messages.<pid>)?
>
> Andy
>
>
>
> On Tue, 4 May 2010, mhanby wrote:
>
>   
>> I haven't found any solution. My SEGV happened in 6.2u4 and after upgrading to 6.2u5 continued.
>>
>> For me, it seems to always happen following a reboot. After several crashes, it seems to stabilize for a while (days, weeks) before it starts again.
>>
>> My workaround is to use Nagios and event handlers to start it back up if it isn't running.
>>
>> -----Original Message-----
>> From: heywood [mailto:heywood at cshl.edu]
>> Sent: Monday, May 03, 2010 12:51 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] qmaster SEGVs
>>
>> We rebooted the node running qmaster, and we are now also getting qmaster
>> crashes. I see in the archive there is another thread "sgemaster keeps
>> crashing 6.2u4" from February which apparently is the same issue. After a
>> number of crashes I got qmaster to keep running (for now!).
>>
>> We are running 6.2u5 with RHEL4.
>>
>> I guess there is no solution/resolution?
>>
>> Todd
>>
>>
>> sge_qmaster[5851]: segfault at 0000000000000080 rip 00000039fa470560 rsp
>> 000000004780aa38 error 4
>> sge_qmaster[6163]: segfault at 0000000000000080 rip 00000039fa470560 rsp
>> 000000004780aa38 error 4
>> sge_qmaster[6573]: segfault at 0000000000000000 rip 00000000005bf6c7 rsp
>> 0000000047809ec0 error 4
>>
>> On 3/17/10 12:14 PM, "abrookfield" <a.brookfield at sheffield.ac.uk> wrote:
>>
>>     
>>> I'm also having problems with qmaster SEGVs in 6.2u5, running on RHEL5,
>>> x86_64.
>>>
>>> Crashes seem to be correlated with users deleting jobs, particularly (but not
>>> exclusively) OpenMPI parallel jobs which have been running for 'a while'.
>>> Other than updating to u5 we've not made any config changes to our setup.
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=249
>>> 186
>>>
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>>       
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255955
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256103
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>>     
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256104
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256212

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list