[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

mhanby mhanby at uab.edu
Thu Mar 25 14:25:49 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I had reported this behavior as well. Our head node has been up for 27 days and the sgemaster segfaults occurred as follows:

Days 1-2: 27 segfaults
Days 3-25: 0 segfaults
Days 26-27: 5 segfaults

The most recent happened when I ran:

watch -d qstat -u mikeh -r

I think the qstat command ran once and the second time reported the commlib error indicating sgemaster was not running.

Mike

-----Original Message-----
From: fx [mailto:d.love at liverpool.ac.uk] 
Sent: Thursday, March 25, 2010 6:17 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

ah_sunsource <ahaupt at ifh.de> writes:

> Hi *,
>
> yesterday afternoon our SGE master started segfaulting again and again
> "out of the blue". No changes to the configuration have been done for
> weeks... Is there anyone else who has already seen this (output of
> dmesg)? :

I don't think the precise output is relevant but, yes, see recent
postings here from me and others, and issue #3251.  You're lucky if it's
stopped -- it hasn't here.  I suspect it's some particular sort of job
that was in the system at some stage with you and is in ours all the
time now, but I don't have any good guesses about what sort.  Currently
I'm just running qmaster under monit with a short check time, as
fortunately we don't have a high throughput.

I'd be very grateful for any debugging hints from developers on how to
debug the corrupt list entries.  I've found it difficult to get to grips
with the code base in the time I've had to look at it so far.  From
experience with similar things, I suspect it's about a week's work to
get to the bottom of it, and I don't have that time.  (I can supply core
dumps and/or a binary compiled `-g -O0' on RedHat 5 if anyone else with
the problem would like to take a look.)

-- 
(Dr) Dave Love
?E-Science?, Computing Services Department, University of Liverpool
AKA fx at gnu.org

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=251318

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=251327

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list