[GE users] qmaster for 6.1U5 crashing

crei crei at sun.com
Thu Feb 26 16:02:46 GMT 2009


On a running qmaster you can try to do

qping -info
qping -dump

to test if the communication threads are still working.

Another possibility is to use the tools strace or truss.

Regards,

Christian



On 02/26/09 14:46, dangruhn wrote:
> As a side node, I've not been able to use a 0:0:0 scheduling interval 
> for a while.  I get an error message in the scheduler message file:
> 
> 01/23/2009 09:46:15|  main|alice|I|starting up SGE 6.2u1_1 (lx24-amd64)
> 01/23/2009 10:09:10|event_|alice|E|invalid event interval 0
> 01/23/2009 10:09:20|event_|alice|E|invalid event interval 0
> ...
> 
> At this point I have to reset the value to non-zero and restart the 
> qmaster.  I thought this was supposed to work and I have used it in the 
> past.
> 
> Dan
> 
> templedf wrote:
>> It's stable at any value other than 0:0:0.  I've seen issues where once 
>> it's set to 0:0:0, the master goes wonky and has to be restarted.  The 
>> "stack trace" for SGE is the dl script.  Source the util/dl.[c]sh script 
>> and then run "dl 1".  When you run any subsequent SGE command (including 
>> sge_qmaster), you will get copious amounts of debug information.
>>
>> Daniel
>>
>> magawake wrote:
>>   
>>> I don't think it was ever set to that low of a number. Does increasing the number help for stability?  What is the recommended number? 
>>>
>>>
>>> Also, how would we take a "stack trace" is there a command I can use? 
>>>
>>> Also, nothing crazy in the logs.
>>>
>>> TIA
>>>
>>>
>>>   
>>>     
>>>> What is ever set to 0:0:0?  What's in the messages file?
>>>>
>>>> Daniel
>>>>
>>>> magawake wrote:
>>>>     
>>>>       
>>>>> The scheduler interval is "0:0:15 "
>>>>>
>>>>>
>>>>> During the crash there are many jobs on the system. Probably 1000 array jobs.
>>>>>
>>>>> The job counter is close to 900k if that makes any difference. 
>>>>>
>>>>> TIA
>>>>>   
>>>>>       
>>>>>         
>>>>>> Am 21.02.2009 um 15:19 schrieb magawake:
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>           
>>>>>>> In the past week our qmaster was in an endless loop. The cpu is was  
>>>>>>> at 100% and no communication to the execd.
>>>>>>>       
>>>>>>>           
>>>>>>>             
>>>>>> What is the setting of the schedule interval and how many jobs are in  
>>>>>> the system?
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>           
>>>>>>> The fix was to simple stop and restart the deamon, but I am not  
>>>>>>> sure what was causing this issue. Next time this occurs, is there  
>>>>>>> something I can do to get more info and submit a bug report? Like  
>>>>>>> logs, strace, debug info, etc..etc..
>>>>>>>
>>>>>>> TIA
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>>>>> dsForumId=38&dsMessageId=111150
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>>       
>>>>>>>           
>>>>>>>             
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=111924
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>       
>>>>>         
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=113042
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>>     
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=115172
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>   
> 

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=115273

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list