[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

ah_sunsource ahaupt at ifh.de
Mon May 31 08:56:29 BST 2010


Hi Marco,

at the "Oracle HPC Consortium" a colleague of mine got the information
that these segfault only occur in self-built binaries. That's not true
in our case. I've replaced the binaries with the "courtesy" version of
6.2u5 and the segfaults continue unfortunately.

Cheers,
Andreas

On Tue, 2010-05-18 at 11:26 +0200, Marco Donauer wrote:
> Andreas,
> 
> thank you again on the mailing list, for providing the debug and log files.
> I will have a look on it and hope to find something.
> 
> Thanks Marco
> 
> 
> Am 18.05.2010 09:07, schrieb ah_sunsource:
> > Hi Marco,
> >
> > I just scripted this but it's simply impossible to run it for more than
> > a very short time. The debug output is growing up to several gigabytes
> > within minutes (even the qping output)! Our farm provides around 1750
> > cpu cores. That's what I would call medium size.
> >
> > Isn't there a way to get the needed data "cheaper"? I will now try to
> > bzip the output on the fly before storing it. I cannot make promises
> > this will work.
> >
> > Cheers,
> > Andreas
> >
> > On Fri, 2010-04-16 at 15:03 +0200, dom wrote:
> >   
> >> Dave,
> >>
> >> your core dumps showed me that the problem occurs through a incomplete
> >> or broken list structure.
> >> But it very hard to find the reason when we are not able to reproduce
> >> the problem. It could be something within the qmaster the scheduler
> >> thread and also the spooling.
> >>
> >> @all:
> >> So could everybody who can see this issue and were it's possible to
> >> restart or to start the qmaster with debugging do the following please?
> >>
> >> # source $SGE_ROOT util/dl.csh/sh
> >> # dl 3
> >> # start the qmaster binary directly as root. The qmaster won't daemonzie
> >> then and a lot of output will be produced.
> >>
> >> I need this output
> >>
> >> Then the scheduler monitoring could be helpful. You can turn it on using
> >> the qconf -msconf and edit the "params" entry to MONITOR=1 in
> >> $SGE_CELL/common/schedule file will be created, this contains the
> >> scheduler thread output
> >>
> >> I need this file too.
> >>
> >> For getting information from the communication plese do the following:
> >>
> >> export SGE_QPING_OUTPUT_FORMAT="s:12"
> >>
> >> exec on masterhost as root:
> >> bin/<arch>/qping -i 5 -dump master_hostname SGE_QMASTER_PORT qmaster 1
> >>
> >> The qping output could also help to find something.
> >>
> >> Also all kind of messages files could be helpful to find a step-in or
> >> any hint what's the reason
> >> for this behaviour and make us able to reproduce the problem.
> >>
> >> Regards,
> >> Marco
> >>
> >>
> >>
> >> Am 16.04.2010 13:16, schrieb fx:
> >>     
> >>> dom <marco.donauer at sun.com> writes:
> >>>
> >>>   
> >>>       
> >>>> Currently I have no hint where and how I could step into this problem.
> >>>>     
> >>>>         
> >>> I think it has to be done systematically -- figuring out how the list
> >>> structures become invalid.  If that's not clear from the core dump, we
> >>> need to instrument the program somehow to try to find where it happens.
> >>> From experience I'm just surprised it's a new sort of bug that's not
> >>> familiar to the developers, so I guess it's due to recent architectural
> >>> changes.
> >>>
> >>> I'm willing to do a reasonable amount of work on this, or provide access
> >>> to our cluster, though I'm not sure whether that will help, as we can't
> >>> keep a debugging session open for long on a crashed qmaster.
> >>>
> >>>
> >>>       
> >> ------------------------------------------------------
> >> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253662
> >>
> >> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >>     
-- 
| Andreas Haupt             | E-Mail: andreas.haupt at desy.de
|  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6          | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen          | Fax:    +49/33762/7-7216

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=260117

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list