[GE users] master configuration - timeout when exec host freezes

templedf dan.templeton at sun.com
Wed Nov 11 14:02:54 GMT 2009


Well, it's a subtlety of wording.  The max_unheard timer starts when the 
qmaster *tries* to contact the execd and fails, or when qmaster 
*notices* that the execd hasn't reported in.  The qmaster and execd 
don't talk to each other unless there's a reason, such as a load report 
or a job operation (start, suspend, resume, delete).

Daniel

madpower wrote:
>> You don't happen to have a very large load report interval, do you?  The 
>> max_unheard timer doesn't start until an execd misses a couple of load 
>> reports.  (I can't remember if it's 1 or 2.)
>>     
> I thought of something like this too. In our case (for testing) we have 10 seconds for load report and 15 seconds for max_unheard.
>
> But, if I am allowed to quote the man-pages:
> "If sge_qmaster (8) could not contact or was not contacted by the execution daemon of a host for max_unheard seconds, all queues residing on that particular host are set to status unknown."
> This means for me, that if the master did not receive a message from a host for max_unheard seconds the host is marked as unreachable - independent of the load_report_time. Or did I miss something?
>
> best regards,
> Matthias
>
>
>
>
>   
>> Daniel
>>
>> madpower wrote:
>>     
>>> hi,
>>>
>>>   
>>>       
>>>> please check the entries "reschedule_unknown" and "max_unheard" in  
>>>> `man sge_conf`.
>>>>     
>>>>         
>>> thanks for this indication. The reschedule_unknown parameter works as expected/wished but the max_unheard is somehow disregarded.
>>> In fact, I could observe the following behavior:
>>> *) if max_unheard is set to a smaller value the load_report_time then after about 20 minutes having this setting the master recognizes that it does not have information on the state of some execution hosts, which is updated as soon as the next load report is sent.
>>> *) if max_unheard is set to a value larger than load_report_time it takes approx. 20-30 minutes until the master recognizes that an execution host is unavailable.
>>>
>>> Does anyone have an idea what's going wrong here? Or did anyone already experienced a similar behavior?
>>>
>>> br,
>>> Matthias
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=226133
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>>       
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=226167
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=226170

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list