[GE users] asking about speeding up load reporting -- johnny layne
laynejg at vcu.edu
Thu Sep 20 22:56:25 BST 2007
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I hope that I can get some info from folks experienced with this.
On a cluster with 116 nodes, we're trying to set things so that users
don't oversubscribe memory. I don't want to depend on users adding "-l"
options to their qsub scripts, they just won't do it reliably. So I've
set aside a node just for me to play with and I've been running some
memory-intensive jobs & watching what happens as I adjust & play with
mem_free and other values for suspend thresholds. It works really great
to set the mem_free value in "Suspend Thresholds" to a value such that
when my jobs get to using too much memory the suspend script I wrote
kicks in. All of that is fine.
Now what I'm wondering, how is it working for people trying to
speed up the reporting time of the load values? I just changed
load_report_time for this 1-node queue to 20 seconds from the default
40, and watched my jobs using top & other methods. The newest memory
hog got suspended nicely & restarted OK as usual, this time it happened
(the suspend-restart-run process) very cleanly and quickly compared to
the 40 second report time. That's really nice! However, I'm sure as I
double the number of load reports that occur, communication costs are
getting a lot worse on the cluster. Have any of you experimented with
this sort of thing? Do you have suggestions about how to test reporting
time values to find an optimal one beyond the trial & error method I'm
using? Do you have any other suggestions about this sort of thing?
Thanks a lot for any information!
Oh the cluster uses gigabit networking.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users