[GE users] Trouble with load thresholds
orion at cora.nwra.com
Tue Mar 9 17:26:29 GMT 2010
On 03/08/2010 03:42 PM, reuti wrote:
> Am 08.03.2010 um 19:44 schrieb opoplawski:
>> Using gridengine 6.2u5. I've got a couple machines in our grid that
>> have lot of interactive use so I limit grid access with a load
>> of np_load_avg = 1, a suspend threshold of 1.3 or 1.02 with load
>> adjustment for np_load_avg of 1.
>> However, my 8 core machines are getting woefully underused.
>> Two different cases:
>> hobbes, suspend threshold of 1.3. top shows load average has been
>> around 3.7-4.3. I generally only see one or two jobs at a time ever
>> run one it. qstat -j shows:
>> queue instance
>> "all.q at hobbes.cora.nwra.com"
>> dropped because it is overloaded: np_load_avg=1.003750 (= 0.541250 +
>> * 3.700000 with nproc=8)>= 1
>> I would have expected about 3-4 jobs on it. I can't make any sense of
>> what the above line is supposed to be telling me.
>> josiah, suspend threshold of 1.02. steady load average about 3.3.
>> got 3 jobs on it, but qstat alternates with:
>> queue instance
>> "compute.q at josiah.cora.nwra.com" dropped because it is overloaded:
>> np_load_avg=1.016250 (= 0.425000 + 1.0 * 4.730000 with nproc=8)>= 1
> 4.73 is not the actual load, but the sum of all decaying factores (see
> below) of all jobs on this exechost where it still applies.
I still just don't understand the line. Nothing seems to add up and I
don't know any of the numbers are. Perhaps a missing parens as well.
0.425000 + 1.0 * 4.730000 = 5.155000 / 8 = .64437500
(0.425000 + 1.0) * 4.730000 = 6.74 / 8 = .84250000
>> - Why are load adjustments used to suspend jobs? I think that should
>> only use the actual load of the machine.
> They aren't. If they would, you could check this with `qstat -f` and
> `qstat -explain A`. The adjusted load will only be used to allow or
> disallow the scheduling of jobs. As the adjusted load value should
> reflect the real load after the decay time, the scheduler is looking
> ahead: if this job would really reach the estimated load, it would
> suspend a job in the near future. Hence the job isn't scheduled.
> So the theory, but I also notice some jobs being pushed into T state w/
> o the queue itself being in suspend alarm state A - confusing. Even a
> single job can push itself into suspend state and will resume and so
> on... Can you please file a bug?
I filed a bug here:
Issue for me seems to be that adjusted loads are indeed used in suspend
threshold calculations and that the load adjustment of a job is not
removed when the job completes.
Technical Manager 303-415-9701 x222
NWRA/CoRA Division FAX: 303-415-9702
3380 Mitchell Lane orion at cora.nwra.com
Boulder, CO 80301 http://www.cora.nwra.com
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users