[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

dom marco.donauer at sun.com
Fri Apr 16 14:03:04 BST 2010


Dave,

your core dumps showed me that the problem occurs through a incomplete
or broken list structure.
But it very hard to find the reason when we are not able to reproduce
the problem. It could be something within the qmaster the scheduler
thread and also the spooling.

@all:
So could everybody who can see this issue and were it's possible to
restart or to start the qmaster with debugging do the following please?

# source $SGE_ROOT util/dl.csh/sh
# dl 3
# start the qmaster binary directly as root. The qmaster won't daemonzie
then and a lot of output will be produced.

I need this output

Then the scheduler monitoring could be helpful. You can turn it on using
the qconf -msconf and edit the "params" entry to MONITOR=1 in
$SGE_CELL/common/schedule file will be created, this contains the
scheduler thread output

I need this file too.

For getting information from the communication plese do the following:

export SGE_QPING_OUTPUT_FORMAT="s:12"

exec on masterhost as root:
bin/<arch>/qping -i 5 -dump master_hostname SGE_QMASTER_PORT qmaster 1

The qping output could also help to find something.

Also all kind of messages files could be helpful to find a step-in or
any hint what's the reason
for this behaviour and make us able to reproduce the problem.

Regards,
Marco



Am 16.04.2010 13:16, schrieb fx:
> dom <marco.donauer at sun.com> writes:
>
>   
>> Currently I have no hint where and how I could step into this problem.
>>     
> I think it has to be done systematically -- figuring out how the list
> structures become invalid.  If that's not clear from the core dump, we
> need to instrument the program somehow to try to find where it happens.
> From experience I'm just surprised it's a new sort of bug that's not
> familiar to the developers, so I guess it's due to recent architectural
> changes.
>
> I'm willing to do a reasonable amount of work on this, or provide access
> to our cluster, though I'm not sure whether that will help, as we can't
> keep a debugging session open for long on a crashed qmaster.
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253662

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list