[GE users] qmaster for 6.1U5 crashing

dangruhn Dan.Gruhn at groupw.com
Thu Feb 26 13:46:52 GMT 2009


As a side node, I've not been able to use a 0:0:0 scheduling interval 
for a while.  I get an error message in the scheduler message file:

01/23/2009 09:46:15|  main|alice|I|starting up SGE 6.2u1_1 (lx24-amd64)
01/23/2009 10:09:10|event_|alice|E|invalid event interval 0
01/23/2009 10:09:20|event_|alice|E|invalid event interval 0
...

At this point I have to reset the value to non-zero and restart the 
qmaster.  I thought this was supposed to work and I have used it in the 
past.

Dan

templedf wrote:
> It's stable at any value other than 0:0:0.  I've seen issues where once 
> it's set to 0:0:0, the master goes wonky and has to be restarted.  The 
> "stack trace" for SGE is the dl script.  Source the util/dl.[c]sh script 
> and then run "dl 1".  When you run any subsequent SGE command (including 
> sge_qmaster), you will get copious amounts of debug information.
>
> Daniel
>
> magawake wrote:
>   
>> I don't think it was ever set to that low of a number. Does increasing the number help for stability?  What is the recommended number? 
>>
>>
>> Also, how would we take a "stack trace" is there a command I can use? 
>>
>> Also, nothing crazy in the logs.
>>
>> TIA
>>
>>
>>   
>>     
>>> What is ever set to 0:0:0?  What's in the messages file?
>>>
>>> Daniel
>>>
>>> magawake wrote:
>>>     
>>>       
>>>> The scheduler interval is "0:0:15 "
>>>>
>>>>
>>>> During the crash there are many jobs on the system. Probably 1000 array jobs.
>>>>
>>>> The job counter is close to 900k if that makes any difference. 
>>>>
>>>> TIA
>>>>   
>>>>       
>>>>         
>>>>> Am 21.02.2009 um 15:19 schrieb magawake:
>>>>>
>>>>>     
>>>>>         
>>>>>           
>>>>>> In the past week our qmaster was in an endless loop. The cpu is was  
>>>>>> at 100% and no communication to the execd.
>>>>>>       
>>>>>>           
>>>>>>             
>>>>> What is the setting of the schedule interval and how many jobs are in  
>>>>> the system?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>     
>>>>>         
>>>>>           
>>>>>> The fix was to simple stop and restart the deamon, but I am not  
>>>>>> sure what was causing this issue. Next time this occurs, is there  
>>>>>> something I can do to get more info and submit a bug report? Like  
>>>>>> logs, strace, debug info, etc..etc..
>>>>>>
>>>>>> TIA
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>>>> dsForumId=38&dsMessageId=111150
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>>       
>>>>>>           
>>>>>>             
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=111924
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>>>       
>>>>         
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=113042
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=115172
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>   

-- 
Dan Gruhn
Group W Inc.
8315 Lee Hwy, Suite 303
Fairfax, VA, 22031
PH: (703) 752-5831
FX: (703) 752-5851

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=115174

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list