[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults
ahaupt at ifh.de
Tue May 18 08:07:54 BST 2010
I just scripted this but it's simply impossible to run it for more than
a very short time. The debug output is growing up to several gigabytes
within minutes (even the qping output)! Our farm provides around 1750
cpu cores. That's what I would call medium size.
Isn't there a way to get the needed data "cheaper"? I will now try to
bzip the output on the fly before storing it. I cannot make promises
this will work.
On Fri, 2010-04-16 at 15:03 +0200, dom wrote:
> your core dumps showed me that the problem occurs through a incomplete
> or broken list structure.
> But it very hard to find the reason when we are not able to reproduce
> the problem. It could be something within the qmaster the scheduler
> thread and also the spooling.
> So could everybody who can see this issue and were it's possible to
> restart or to start the qmaster with debugging do the following please?
> # source $SGE_ROOT util/dl.csh/sh
> # dl 3
> # start the qmaster binary directly as root. The qmaster won't daemonzie
> then and a lot of output will be produced.
> I need this output
> Then the scheduler monitoring could be helpful. You can turn it on using
> the qconf -msconf and edit the "params" entry to MONITOR=1 in
> $SGE_CELL/common/schedule file will be created, this contains the
> scheduler thread output
> I need this file too.
> For getting information from the communication plese do the following:
> export SGE_QPING_OUTPUT_FORMAT="s:12"
> exec on masterhost as root:
> bin/<arch>/qping -i 5 -dump master_hostname SGE_QMASTER_PORT qmaster 1
> The qping output could also help to find something.
> Also all kind of messages files could be helpful to find a step-in or
> any hint what's the reason
> for this behaviour and make us able to reproduce the problem.
> Am 16.04.2010 13:16, schrieb fx:
> > dom <marco.donauer at sun.com> writes:
> >> Currently I have no hint where and how I could step into this problem.
> > I think it has to be done systematically -- figuring out how the list
> > structures become invalid. If that's not clear from the core dump, we
> > need to instrument the program somehow to try to find where it happens.
> > From experience I'm just surprised it's a new sort of bug that's not
> > familiar to the developers, so I guess it's due to recent architectural
> > changes.
> > I'm willing to do a reasonable amount of work on this, or provide access
> > to our cluster, though I'm not sure whether that will help, as we can't
> > keep a debugging session open for long on a crashed qmaster.
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
| Andreas Haupt | E-Mail: andreas.haupt at desy.de
| DESY Zeuthen | WWW: http://www-zeuthen.desy.de/~ahaupt
| Platanenallee 6 | Phone: +49/33762/7-7359
| D-15738 Zeuthen | Fax: +49/33762/7-7216
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users