[GE users] Trouble with load thresholds

opoplawski orion at cora.nwra.com
Tue Mar 9 17:26:29 GMT 2010


On 03/08/2010 03:42 PM, reuti wrote:
> Am 08.03.2010 um 19:44 schrieb opoplawski:
>
>> Using gridengine 6.2u5.  I've got a couple machines in our grid that
>> have lot of interactive use so I limit grid access with a load
>> threshold
>> of np_load_avg = 1, a suspend threshold of 1.3 or 1.02 with load
>> adjustment for np_load_avg of 1.
>>
>> However, my 8 core machines are getting woefully underused.
>>
>> Two different cases:
>>
>> hobbes, suspend threshold of 1.3.  top shows load average has been
>> around 3.7-4.3.  I generally only see one or two jobs at a time ever
>> get
>> run one it.  qstat -j shows:
>>
>>                              queue instance
>> "all.q at hobbes.cora.nwra.com"
>> dropped because it is overloaded: np_load_avg=1.003750 (= 0.541250 +
>> 1.0
>> * 3.700000 with nproc=8)>= 1
>>
>> I would have expected about 3-4 jobs on it.  I can't make any sense of
>> what the above line is supposed to be telling me.
>>
>>
>> josiah, suspend threshold of 1.02.  steady load average about 3.3.
>>
>> got 3 jobs on it, but qstat alternates with:
>>
>>                              queue instance
>> "compute.q at josiah.cora.nwra.com" dropped because it is overloaded:
>> np_load_avg=1.016250 (= 0.425000 + 1.0 * 4.730000 with nproc=8)>= 1
>
> 4.73 is not the actual load, but the sum of all decaying factores (see
> below) of all jobs on this exechost where it still applies.

I still just don't understand the line.  Nothing seems to add up and I 
don't know any of the numbers are.  Perhaps a missing parens as well.

0.425000 + 1.0 * 4.730000   = 5.155000 / 8 = .64437500
(0.425000 + 1.0) * 4.730000 = 6.74 / 8     = .84250000

?

>> - Why are load adjustments used to suspend jobs?  I think that should
>> only use the actual load of the machine.
>
> They aren't. If they would, you could check this with `qstat -f` and
> `qstat -explain A`. The adjusted load will only be used to allow or
> disallow the scheduling of jobs. As the adjusted load value should
> reflect the real load after the decay time, the scheduler is looking
> ahead: if this job would really reach the estimated load, it would
> suspend a job in the near future. Hence the job isn't scheduled.
>
> So the theory, but I also notice some jobs being pushed into T state w/
> o the queue itself being in suspend alarm state A - confusing. Even a
> single job can push itself into suspend state and will resume and so
> on... Can you please file a bug?

I filed a bug here:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=3248

Issue for me seems to be that adjusted loads are indeed used in suspend 
threshold calculations and that the load adjustment of a job is not 
removed when the job completes.

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA/CoRA Division                    FAX: 303-415-9702
3380 Mitchell Lane                  orion at cora.nwra.com
Boulder, CO 80301              http://www.cora.nwra.com

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247724

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list