[GE users] qmaster for 6.1U5 crashing

templedf dan.templeton at sun.com
Thu Feb 26 13:45:51 GMT 2009


It's stable at any value other than 0:0:0.  I've seen issues where once 
it's set to 0:0:0, the master goes wonky and has to be restarted.  The 
"stack trace" for SGE is the dl script.  Source the util/dl.[c]sh script 
and then run "dl 1".  When you run any subsequent SGE command (including 
sge_qmaster), you will get copious amounts of debug information.

Daniel

magawake wrote:
> I don't think it was ever set to that low of a number. Does increasing the number help for stability?  What is the recommended number? 
>
>
> Also, how would we take a "stack trace" is there a command I can use? 
>
> Also, nothing crazy in the logs.
>
> TIA
>
>
>   
>> What is ever set to 0:0:0?  What's in the messages file?
>>
>> Daniel
>>
>> magawake wrote:
>>     
>>> The scheduler interval is "0:0:15 "
>>>
>>>
>>> During the crash there are many jobs on the system. Probably 1000 array jobs.
>>>
>>> The job counter is close to 900k if that makes any difference. 
>>>
>>> TIA
>>>   
>>>       
>>>> Am 21.02.2009 um 15:19 schrieb magawake:
>>>>
>>>>     
>>>>         
>>>>> In the past week our qmaster was in an endless loop. The cpu is was  
>>>>> at 100% and no communication to the execd.
>>>>>       
>>>>>           
>>>> What is the setting of the schedule interval and how many jobs are in  
>>>> the system?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>     
>>>>         
>>>>> The fix was to simple stop and restart the deamon, but I am not  
>>>>> sure what was causing this issue. Next time this occurs, is there  
>>>>> something I can do to get more info and submit a bug report? Like  
>>>>> logs, strace, debug info, etc..etc..
>>>>>
>>>>> TIA
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>>> dsForumId=38&dsMessageId=111150
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>       
>>>>>           
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=111924
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>>       
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=113042
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=115172

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list