[GE users] master configuration - timeout when exec host freezes
crei at sun.com
Wed Nov 11 13:57:59 GMT 2009
On 11/11/09 14:33, madpower wrote:
>> - Can you please tell what values are configured for "load_report_time and" "max_unheard"?
> Two settings:
> 1) max_unheard = 0:0:15 and load_report_time = 0:0:10
> 2) max_unheard = 0:0:01 and load_report_time = 0:0:10
> Of course, I know that setting 2) is not meaningful. This was just for testing. Nevertheless, it seems to me that any change in any of these two values takes effect after about 15 minutes (earliest). As soon, as a host is recognized as "not reachable", everything is working great (rescheduling, etc.). But, hosst are almost never recognized as "not reachable". (Sometimes this works. But I cannot figure out in which situations it is working and in which not. Anyhow, in most situations it is not working.)
Ok - here I found a problem that the last heard time is also set when the qmaster wants to send a
message to the execd which is not how it should be. It should only set the last heard time when
he got a new message from the execd.
I've opened an issue for this:
>> - Can you please also describe how you evoke the "freeze" of your execd host?
> In the current test setting I unplug the network cable such that the host is not reachable for any network communication.
> Later on, there are two situations which we expect to occur from time to time: 1) There is no network connection due to switch misconfiguration (unfortunately, our exechosts are situated in a room in another building... but this is another (long) story).
> 2) A host freezes due to extensive swapping until the machine is absolutely unreachable - it can only be recovered by unplugging it and then turning it on again. This mainly happens due to software bugs in code segments which are written by students. Unfortunately, we cannot (and do not want to) forbid the students to use this cluster.
> Hope, this helps in understanding the problem.
Yes - and I have to say that there is a problem to detect network outages.
Please see the following link regarding "Why does it take so long to detect that the peer died?"
But I think fixing the problem described above will help.
I will try our scenario for testing the max_unheard parameter ...
> Best regards,
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
Sun Microsystems GmbH Christian Reissmann
Dr.-Leo-Ritter-Str. 7 Software Engineer
D-93049 Regensburg Phone: +49 (0)941 3075 112
Germany Fax: +49 (0)941 3075 222
http://www.sun.de mailto: Christian.Reissmann at sun.com
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users