[GE users] master configuration - timeout when exec host freezes
prandtstetter at ads.tuwien.ac.at
Wed Nov 11 13:33:47 GMT 2009
> - Can you please tell what values are configured for "load_report_time and" "max_unheard"?
1) max_unheard = 0:0:15 and load_report_time = 0:0:10
2) max_unheard = 0:0:01 and load_report_time = 0:0:10
Of course, I know that setting 2) is not meaningful. This was just for testing. Nevertheless, it seems to me that any change in any of these two values takes effect after about 15 minutes (earliest). As soon, as a host is recognized as "not reachable", everything is working great (rescheduling, etc.). But, hosst are almost never recognized as "not reachable". (Sometimes this works. But I cannot figure out in which situations it is working and in which not. Anyhow, in most situations it is not working.)
> - Can you please also describe how you evoke the "freeze" of your execd host?
In the current test setting I unplug the network cable such that the host is not reachable for any network communication.
Later on, there are two situations which we expect to occur from time to time: 1) There is no network connection due to switch misconfiguration (unfortunately, our exechosts are situated in a room in another building... but this is another (long) story).
2) A host freezes due to extensive swapping until the machine is absolutely unreachable - it can only be recovered by unplugging it and then turning it on again. This mainly happens due to software bugs in code segments which are written by students. Unfortunately, we cannot (and do not want to) forbid the students to use this cluster.
Hope, this helps in understanding the problem.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users