[GE users] Trouble with load thresholds
reuti at staff.uni-marburg.de
Wed Mar 10 13:45:36 GMT 2010
Am 09.03.2010 um 18:26 schrieb opoplawski:
>>> queue instance
>>> "compute.q at josiah.cora.nwra.com" dropped because it is overloaded:
>>> np_load_avg=1.016250 (= 0.425000 + 1.0 * 4.730000 with nproc=8)>= 1
>> 4.73 is not the actual load, but the sum of all decaying factores
>> below) of all jobs on this exechost where it still applies.
> I still just don't understand the line. Nothing seems to add up and I
> don't know any of the numbers are. Perhaps a missing parens as well.
> 0.425000 + 1.0 * 4.730000 = 5.155000 / 8 = .64437500
> (0.425000 + 1.0) * 4.730000 = 6.74 / 8 = .84250000
>>> 1.016250 (= 0.425000 + 1.0 * 4.730000 with nproc=8)
(8 * 0.425000 + 1.0 * 4.730000) / 8
i.e. the np_load_avg will first be multiplied to get the absolute
load again, then add the load_adjustment multiplied by 4.73 and
devide the sum again.
The 4.73 is the sum of all load_adjustment. With one job, it's 1.0
decreasing to 0.0 in the time which is specified
load_adjustment_decay_time in the scheduler configuration. But when a
job finished, it should be removed at the next cycle of
The 4.73 should in your case come from several jobs running on one
and the same node with different summands making the total, as not
all jobs were started at the same time. The amount of each job is not
itemized though to be tracked.
>>> - Why are load adjustments used to suspend jobs? I think that
>>> only use the actual load of the machine.
>> They aren't. If they would, you could check this with `qstat -f` and
>> `qstat -explain A`. The adjusted load will only be used to allow or
>> disallow the scheduling of jobs. As the adjusted load value should
>> reflect the real load after the decay time, the scheduler is looking
>> ahead: if this job would really reach the estimated load, it would
>> suspend a job in the near future. Hence the job isn't scheduled.
>> So the theory, but I also notice some jobs being pushed into T
>> state w/
>> o the queue itself being in suspend alarm state A - confusing. Even a
>> single job can push itself into suspend state and will resume and so
>> on... Can you please file a bug?
> I filed a bug here:
> Issue for me seems to be that adjusted loads are indeed used in
> threshold calculations and that the load adjustment of a job is not
> removed when the job completes.
Thx. Even if it would be intended, the state of the queue should
reflect this. For now only the job is switching between T and r.
> Orion Poplawski
> Technical Manager 303-415-9701 x222
> NWRA/CoRA Division FAX: 303-415-9702
> 3380 Mitchell Lane orion at cora.nwra.com
> Boulder, CO 80301 http://www.cora.nwra.com
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users