[GE users] sge 6.2U3 scheduler problem

templedf dan.templeton at sun.com
Fri Dec 18 13:57:28 GMT 2009


How reproducible is the problem?  Does it happen every time you restart 
the qmaster?

Daniel

esneeh wrote:
> Dan, the method is classic.   I can't see a pattern about the hosts.
> I did add some hostgroups and queues, though, but using existing hosts.
> We've done that before without seeing this side effect (scheduler not available).
>  
>
> Thanks,
> Eddie
> ---
>
>
> -----Original Message-----
> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
> Sent: Thursday, December 17, 2009 9:03 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] sge 6.2U3 scheduler problem
>
> What spooling method are you using?  Is there any particular pattern 
> about which hosts have the jobs that are declared redundant?  How 
> reproducible is the problem?
>
> Daniel
>
> esneeh wrote:
>   
>> Daniel, we restarted the daemon on the server.  This used to be our formula for fixing everything in sge5.3.
>>
>> Thanks,
>> Eddie
>> ---
>>
>>
>> -----Original Message-----
>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
>> Sent: Thursday, December 17, 2009 5:24 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] sge 6.2U3 scheduler problem
>>
>> You edit the global host config with "qconf -mconf".  There's an 
>> attribute called qmaster_params.  See sge_conf(5).
>>
>> |E| messages aren't normal.  Not surprisingly, the E is for error.
>>
>> What is the trigger for this condition?  Does it follow a qmaster 
>> restart?  Execd restart?
>>
>> Daniel
>>
>> esneeh wrote:
>>   
>>     
>>> Chris, Daniel, Fred, thanks for responding.
>>>
>>> 1. Fred, qconf -secl returns the name of the "ADMIN_HOST_LIST"  (1 server) that's in the config file.
>>>
>>> 2. Chris, yes, there is a relatively large number of pending jobs, but that's always been the case.
>>>    The messages file has entries like:
>>>
>>> 12/17/2009 14:50:51|worker|tspirit|E|execd at lcd-770 reports running job (8626295.1/master) in queue "msi_1cpu.q at lcd-770" that was not supposed to be there - killing
>>> .12/17/2009 14:50:51|worker|tspirit|I|removing trigger to terminate job 89396.1
>>> 12/17/2009 14:50:51|worker|tspirit|I|job 89396.1 finished on host lcd-563
>>> Is the |E| message something that's "normal"?
>>>
>>> 3. Daniel, I see the qrsh_control_port error usually with the message:
>>>    "Can not get job info messages, scheduler is not available"
>>>    Where is the global host conf file?  I'm not able to see the qmaster_params in the gui as well.
>>>    The qmaster/schedd/messages is size 0, doesn't have anything.
>>>
>>> Thanks again,
>>> Eddie
>>>
>>> ---
>>>
>>>
>>> -----Original Message-----
>>> From: fy [mailto:fly at anydata.co.uk] 
>>> Sent: Thursday, December 17, 2009 3:41 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] sge 6.2U3 scheduler problem
>>>
>>> Is the scheduler running?
>>>
>>> If you type "qconf -secl", do you get output like this?
>>>
>>>        ID NAME            HOST
>>> --------------------------------------------------
>>>         1 scheduler      ...
>>>
>>>
>>> Fred Youhanaie
>>>
>>>
>>> On 17/12/09 22:04, esneeh wrote:
>>>   
>>>     
>>>       
>>>> Hi everyone, I'm using SGE 6.2U3.  Jobs have stopped getting scheduled all of a sudden, and qstat is giving me the following message:
>>>> "Can not get job info messages, scheduler is not available"
>>>>
>>>> Does anyone know what might be causing this message and what can be done to get jobs running again?
>>>>
>>>>
>>>> Thanks for any advice,
>>>> Eddie
>>>> ---
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=233998
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>     
>>>>       
>>>>         
>>> ------------------------------------------------------
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234022
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>>     
>>>       
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234023
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234031
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234039
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234052
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234094

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list