[GE users] Trouble with load thresholds

reuti reuti at staff.uni-marburg.de
Wed Mar 10 13:45:36 GMT 2010


Am 09.03.2010 um 18:26 schrieb opoplawski:

> <snip>
>>>                              queue instance
>>> "compute.q at josiah.cora.nwra.com" dropped because it is overloaded:
>>> np_load_avg=1.016250 (= 0.425000 + 1.0 * 4.730000 with nproc=8)>= 1
>>
>> 4.73 is not the actual load, but the sum of all decaying factores  
>> (see
>> below) of all jobs on this exechost where it still applies.
>
> I still just don't understand the line.  Nothing seems to add up and I
> don't know any of the numbers are.  Perhaps a missing parens as well.
>
> 0.425000 + 1.0 * 4.730000   = 5.155000 / 8 = .64437500
> (0.425000 + 1.0) * 4.730000 = 6.74 / 8     = .84250000

The line:

>>> 1.016250 (= 0.425000 + 1.0 * 4.730000 with nproc=8)

means:

(8 * 0.425000 + 1.0 * 4.730000) / 8

i.e. the np_load_avg will first be multiplied to get the absolute  
load again, then add the load_adjustment multiplied by 4.73 and  
devide the sum again.

The 4.73 is the sum of all load_adjustment. With one job, it's 1.0  
decreasing to 0.0 in the time which is specified  
load_adjustment_decay_time in the scheduler configuration. But when a  
job finished, it should be removed at the next cycle of  
load_report_time.

The 4.73 should in your case come from several jobs running on one  
and the same node with different summands making the total, as not  
all jobs were started at the same time. The amount of each job is not  
itemized though to be tracked.


>
> ?
>
>>> - Why are load adjustments used to suspend jobs?  I think that  
>>> should
>>> only use the actual load of the machine.
>>
>> They aren't. If they would, you could check this with `qstat -f` and
>> `qstat -explain A`. The adjusted load will only be used to allow or
>> disallow the scheduling of jobs. As the adjusted load value should
>> reflect the real load after the decay time, the scheduler is looking
>> ahead: if this job would really reach the estimated load, it would
>> suspend a job in the near future. Hence the job isn't scheduled.
>>
>> So the theory, but I also notice some jobs being pushed into T  
>> state w/
>> o the queue itself being in suspend alarm state A - confusing. Even a
>> single job can push itself into suspend state and will resume and so
>> on... Can you please file a bug?
>
> I filed a bug here:
>
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=3248
>
> Issue for me seems to be that adjusted loads are indeed used in  
> suspend
> threshold calculations and that the load adjustment of a job is not
> removed when the job completes.

Thx. Even if it would be intended, the state of the queue should  
reflect this. For now only the job is switching between T and r.

-- Reuti


>
> -- 
> Orion Poplawski
> Technical Manager                     303-415-9701 x222
> NWRA/CoRA Division                    FAX: 303-415-9702
> 3380 Mitchell Lane                  orion at cora.nwra.com
> Boulder, CO 80301              http://www.cora.nwra.com
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=247724
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247835

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list