[GE users] Trouble with load thresholds
orion at cora.nwra.com
Mon Mar 8 18:44:44 GMT 2010
Using gridengine 6.2u5. I've got a couple machines in our grid that
have lot of interactive use so I limit grid access with a load threshold
of np_load_avg = 1, a suspend threshold of 1.3 or 1.02 with load
adjustment for np_load_avg of 1.
However, my 8 core machines are getting woefully underused.
Two different cases:
hobbes, suspend threshold of 1.3. top shows load average has been
around 3.7-4.3. I generally only see one or two jobs at a time ever get
run one it. qstat -j shows:
queue instance "all.q at hobbes.cora.nwra.com"
dropped because it is overloaded: np_load_avg=1.003750 (= 0.541250 + 1.0
* 3.700000 with nproc=8) >= 1
I would have expected about 3-4 jobs on it. I can't make any sense of
what the above line is supposed to be telling me.
josiah, suspend threshold of 1.02. steady load average about 3.3.
got 3 jobs on it, but qstat alternates with:
"compute.q at josiah.cora.nwra.com" dropped because it is overloaded:
np_load_avg=1.016250 (= 0.425000 + 1.0 * 4.730000 with nproc=8) >= 1
"compute.q at josiah.cora.nwra.com" is in suspend alarm:
np_load_avg=1.026250 (= 0.425000 + 1.0 * 4.810000 with nproc=8) >= 1.02
Some thoughts -
- These are very short jobs, just a few seconds of cpu time, must be
playing havoc with load adjustments? Does load adjustment get removed
when a job ends?
- Why are load adjustments used to suspend jobs? I think that should
only use the actual load of the machine.
Technical Manager 303-415-9701 x222
NWRA/CoRA Division FAX: 303-415-9702
3380 Mitchell Lane orion at cora.nwra.com
Boulder, CO 80301 http://www.cora.nwra.com
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users