[GE users] master configuration - timeout when exec host freezes

templedf dan.templeton at sun.com
Wed Nov 11 13:34:39 GMT 2009


You don't happen to have a very large load report interval, do you?  The 
max_unheard timer doesn't start until an execd misses a couple of load 
reports.  (I can't remember if it's 1 or 2.)

Daniel

madpower wrote:
> hi,
>
>   
>> please check the entries "reschedule_unknown" and "max_unheard" in  
>> `man sge_conf`.
>>     
> thanks for this indication. The reschedule_unknown parameter works as expected/wished but the max_unheard is somehow disregarded.
> In fact, I could observe the following behavior:
> *) if max_unheard is set to a smaller value the load_report_time then after about 20 minutes having this setting the master recognizes that it does not have information on the state of some execution hosts, which is updated as soon as the next load report is sent.
> *) if max_unheard is set to a value larger than load_report_time it takes approx. 20-30 minutes until the master recognizes that an execution host is unavailable.
>
> Does anyone have an idea what's going wrong here? Or did anyone already experienced a similar behavior?
>
> br,
> Matthias
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=226133
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=226163

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list