[GE users] qmaster SEGVs

andy andy.schwierskott at sun.com
Tue May 4 15:25:38 BST 2010


Hi,

Do you have PE jobs running when this happens? Is it tightly integrated
parallel jobs?

What messages do you see in the qmaster messages file (or in
/tmp/qmaster_messages.<pid>)?

Andy



On Tue, 4 May 2010, mhanby wrote:

> I haven't found any solution. My SEGV happened in 6.2u4 and after upgrading to 6.2u5 continued.
>
> For me, it seems to always happen following a reboot. After several crashes, it seems to stabilize for a while (days, weeks) before it starts again.
>
> My workaround is to use Nagios and event handlers to start it back up if it isn't running.
>
> -----Original Message-----
> From: heywood [mailto:heywood at cshl.edu]
> Sent: Monday, May 03, 2010 12:51 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] qmaster SEGVs
>
> We rebooted the node running qmaster, and we are now also getting qmaster
> crashes. I see in the archive there is another thread "sgemaster keeps
> crashing 6.2u4" from February which apparently is the same issue. After a
> number of crashes I got qmaster to keep running (for now!).
>
> We are running 6.2u5 with RHEL4.
>
> I guess there is no solution/resolution?
>
> Todd
>
>
> sge_qmaster[5851]: segfault at 0000000000000080 rip 00000039fa470560 rsp
> 000000004780aa38 error 4
> sge_qmaster[6163]: segfault at 0000000000000080 rip 00000039fa470560 rsp
> 000000004780aa38 error 4
> sge_qmaster[6573]: segfault at 0000000000000000 rip 00000000005bf6c7 rsp
> 0000000047809ec0 error 4
>
> On 3/17/10 12:14 PM, "abrookfield" <a.brookfield at sheffield.ac.uk> wrote:
>
> > I'm also having problems with qmaster SEGVs in 6.2u5, running on RHEL5,
> > x86_64.
> >
> > Crashes seem to be correlated with users deleting jobs, particularly (but not
> > exclusively) OpenMPI parallel jobs which have been running for 'a while'.
> > Other than updating to u5 we've not made any config changes to our setup.
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=249
> > 186
> >
> > To unsubscribe from this discussion, e-mail:
> > [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255955
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256103
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256104

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list