[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

dom marco.donauer at sun.com
Tue May 18 10:26:40 BST 2010


Andreas,

thank you again on the mailing list, for providing the debug and log files.
I will have a look on it and hope to find something.

Thanks Marco


Am 18.05.2010 09:07, schrieb ah_sunsource:
> Hi Marco,
>
> I just scripted this but it's simply impossible to run it for more than
> a very short time. The debug output is growing up to several gigabytes
> within minutes (even the qping output)! Our farm provides around 1750
> cpu cores. That's what I would call medium size.
>
> Isn't there a way to get the needed data "cheaper"? I will now try to
> bzip the output on the fly before storing it. I cannot make promises
> this will work.
>
> Cheers,
> Andreas
>
> On Fri, 2010-04-16 at 15:03 +0200, dom wrote:
>   
>> Dave,
>>
>> your core dumps showed me that the problem occurs through a incomplete
>> or broken list structure.
>> But it very hard to find the reason when we are not able to reproduce
>> the problem. It could be something within the qmaster the scheduler
>> thread and also the spooling.
>>
>> @all:
>> So could everybody who can see this issue and were it's possible to
>> restart or to start the qmaster with debugging do the following please?
>>
>> # source $SGE_ROOT util/dl.csh/sh
>> # dl 3
>> # start the qmaster binary directly as root. The qmaster won't daemonzie
>> then and a lot of output will be produced.
>>
>> I need this output
>>
>> Then the scheduler monitoring could be helpful. You can turn it on using
>> the qconf -msconf and edit the "params" entry to MONITOR=1 in
>> $SGE_CELL/common/schedule file will be created, this contains the
>> scheduler thread output
>>
>> I need this file too.
>>
>> For getting information from the communication plese do the following:
>>
>> export SGE_QPING_OUTPUT_FORMAT="s:12"
>>
>> exec on masterhost as root:
>> bin/<arch>/qping -i 5 -dump master_hostname SGE_QMASTER_PORT qmaster 1
>>
>> The qping output could also help to find something.
>>
>> Also all kind of messages files could be helpful to find a step-in or
>> any hint what's the reason
>> for this behaviour and make us able to reproduce the problem.
>>
>> Regards,
>> Marco
>>
>>
>>
>> Am 16.04.2010 13:16, schrieb fx:
>>     
>>> dom <marco.donauer at sun.com> writes:
>>>
>>>   
>>>       
>>>> Currently I have no hint where and how I could step into this problem.
>>>>     
>>>>         
>>> I think it has to be done systematically -- figuring out how the list
>>> structures become invalid.  If that's not clear from the core dump, we
>>> need to instrument the program somehow to try to find where it happens.
>>> From experience I'm just surprised it's a new sort of bug that's not
>>> familiar to the developers, so I guess it's due to recent architectural
>>> changes.
>>>
>>> I'm willing to do a reasonable amount of work on this, or provide access
>>> to our cluster, though I'm not sure whether that will help, as we can't
>>> keep a debugging session open for long on a crashed qmaster.
>>>
>>>
>>>       
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253662
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257719

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list