[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

dom marco.donauer at sun.com
Mon May 31 13:32:18 BST 2010


Andreas,

from whom did he get this info? Someone from Oracle or was this a 
customer/user?
Thanks for this.
Currently a can't explicitly work on this, due to other tasks, but after 
finishing them I will take up the investigations again on this problem.
I hope that I can find something out.

Regards,
Marco


On 05/31/10 09:56, ah_sunsource wrote:
> Hi Marco,
>
> at the "Oracle HPC Consortium" a colleague of mine got the information
> that these segfault only occur in self-built binaries. That's not true
> in our case. I've replaced the binaries with the "courtesy" version of
> 6.2u5 and the segfaults continue unfortunately.
>
> Cheers,
> Andreas
>
> On Tue, 2010-05-18 at 11:26 +0200, Marco Donauer wrote:
>   
>> Andreas,
>>
>> thank you again on the mailing list, for providing the debug and log files.
>> I will have a look on it and hope to find something.
>>
>> Thanks Marco
>>
>>
>> Am 18.05.2010 09:07, schrieb ah_sunsource:
>>     
>>> Hi Marco,
>>>
>>> I just scripted this but it's simply impossible to run it for more than
>>> a very short time. The debug output is growing up to several gigabytes
>>> within minutes (even the qping output)! Our farm provides around 1750
>>> cpu cores. That's what I would call medium size.
>>>
>>> Isn't there a way to get the needed data "cheaper"? I will now try to
>>> bzip the output on the fly before storing it. I cannot make promises
>>> this will work.
>>>
>>> Cheers,
>>> Andreas
>>>
>>> On Fri, 2010-04-16 at 15:03 +0200, dom wrote:
>>>   
>>>       
>>>> Dave,
>>>>
>>>> your core dumps showed me that the problem occurs through a incomplete
>>>> or broken list structure.
>>>> But it very hard to find the reason when we are not able to reproduce
>>>> the problem. It could be something within the qmaster the scheduler
>>>> thread and also the spooling.
>>>>
>>>> @all:
>>>> So could everybody who can see this issue and were it's possible to
>>>> restart or to start the qmaster with debugging do the following please?
>>>>
>>>> # source $SGE_ROOT util/dl.csh/sh
>>>> # dl 3
>>>> # start the qmaster binary directly as root. The qmaster won't daemonzie
>>>> then and a lot of output will be produced.
>>>>
>>>> I need this output
>>>>
>>>> Then the scheduler monitoring could be helpful. You can turn it on using
>>>> the qconf -msconf and edit the "params" entry to MONITOR=1 in
>>>> $SGE_CELL/common/schedule file will be created, this contains the
>>>> scheduler thread output
>>>>
>>>> I need this file too.
>>>>
>>>> For getting information from the communication plese do the following:
>>>>
>>>> export SGE_QPING_OUTPUT_FORMAT="s:12"
>>>>
>>>> exec on masterhost as root:
>>>> bin/<arch>/qping -i 5 -dump master_hostname SGE_QMASTER_PORT qmaster 1
>>>>
>>>> The qping output could also help to find something.
>>>>
>>>> Also all kind of messages files could be helpful to find a step-in or
>>>> any hint what's the reason
>>>> for this behaviour and make us able to reproduce the problem.
>>>>
>>>> Regards,
>>>> Marco
>>>>
>>>>
>>>>
>>>> Am 16.04.2010 13:16, schrieb fx:
>>>>     
>>>>         
>>>>> dom <marco.donauer at sun.com> writes:
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> Currently I have no hint where and how I could step into this problem.
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> I think it has to be done systematically -- figuring out how the list
>>>>> structures become invalid.  If that's not clear from the core dump, we
>>>>> need to instrument the program somehow to try to find where it happens.
>>>>> From experience I'm just surprised it's a new sort of bug that's not
>>>>> familiar to the developers, so I guess it's due to recent architectural
>>>>> changes.
>>>>>
>>>>> I'm willing to do a reasonable amount of work on this, or provide access
>>>>> to our cluster, though I'm not sure whether that will help, as we can't
>>>>> keep a debugging session open for long on a crashed qmaster.
>>>>>
>>>>>
>>>>>       
>>>>>           
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253662
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>     
>>>>         

-- 

Sun Microsystems GmbH         Marco Donauer
Dr.-Leo-Ritter-Str. 7         SUN Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-211  (x60211)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:marco.donauer at sun.com
Sitz der Gesellschaft: 
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Juergen Kunz

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=260149

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list