[GE users] sge 6.2U3 scheduler problem

esneeh esneeh at marvell.com
Fri Dec 18 07:29:07 GMT 2009


Dan, the method is classic.   I can't see a pattern about the hosts.
I did add some hostgroups and queues, though, but using existing hosts.
We've done that before without seeing this side effect (scheduler not available).
 

Thanks,
Eddie
---


-----Original Message-----
From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
Sent: Thursday, December 17, 2009 9:03 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] sge 6.2U3 scheduler problem

What spooling method are you using?  Is there any particular pattern 
about which hosts have the jobs that are declared redundant?  How 
reproducible is the problem?

Daniel

esneeh wrote:
> Daniel, we restarted the daemon on the server.  This used to be our formula for fixing everything in sge5.3.
>
> Thanks,
> Eddie
> ---
>
>
> -----Original Message-----
> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
> Sent: Thursday, December 17, 2009 5:24 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] sge 6.2U3 scheduler problem
>
> You edit the global host config with "qconf -mconf".  There's an 
> attribute called qmaster_params.  See sge_conf(5).
>
> |E| messages aren't normal.  Not surprisingly, the E is for error.
>
> What is the trigger for this condition?  Does it follow a qmaster 
> restart?  Execd restart?
>
> Daniel
>
> esneeh wrote:
>   
>> Chris, Daniel, Fred, thanks for responding.
>>
>> 1. Fred, qconf -secl returns the name of the "ADMIN_HOST_LIST"  (1 server) that's in the config file.
>>
>> 2. Chris, yes, there is a relatively large number of pending jobs, but that's always been the case.
>>    The messages file has entries like:
>>
>> 12/17/2009 14:50:51|worker|tspirit|E|execd at lcd-770 reports running job (8626295.1/master) in queue "msi_1cpu.q at lcd-770" that was not supposed to be there - killing
>> .12/17/2009 14:50:51|worker|tspirit|I|removing trigger to terminate job 89396.1
>> 12/17/2009 14:50:51|worker|tspirit|I|job 89396.1 finished on host lcd-563
>> Is the |E| message something that's "normal"?
>>
>> 3. Daniel, I see the qrsh_control_port error usually with the message:
>>    "Can not get job info messages, scheduler is not available"
>>    Where is the global host conf file?  I'm not able to see the qmaster_params in the gui as well.
>>    The qmaster/schedd/messages is size 0, doesn't have anything.
>>
>> Thanks again,
>> Eddie
>>
>> ---
>>
>>
>> -----Original Message-----
>> From: fy [mailto:fly at anydata.co.uk] 
>> Sent: Thursday, December 17, 2009 3:41 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] sge 6.2U3 scheduler problem
>>
>> Is the scheduler running?
>>
>> If you type "qconf -secl", do you get output like this?
>>
>>        ID NAME            HOST
>> --------------------------------------------------
>>         1 scheduler      ...
>>
>>
>> Fred Youhanaie
>>
>>
>> On 17/12/09 22:04, esneeh wrote:
>>   
>>     
>>> Hi everyone, I'm using SGE 6.2U3.  Jobs have stopped getting scheduled all of a sudden, and qstat is giving me the following message:
>>> "Can not get job info messages, scheduler is not available"
>>>
>>> Does anyone know what might be causing this message and what can be done to get jobs running again?
>>>
>>>
>>> Thanks for any advice,
>>> Eddie
>>> ---
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=233998
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>     
>>>       
>> ------------------------------------------------------
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234022
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234023
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234031
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234039

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234052

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list