[GE users] selection criteria for automatic job suspension

reuti reuti at staff.uni-marburg.de
Thu Jan 14 12:56:17 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 14.01.2010 um 13:32 schrieb massot:

> On Thu, Jan 14, 2010 at 12:54:20PM +0100, reuti wrote:
>> Am 11.01.2010 um 16:57 schrieb massot:
>>> According to what I understand after reading source code (man page
>>> isn't fully informative), when suspend_thresholds is reached on a
>>> host, the job selected for suspension is the one running for the
>>> shortest time, and we can't do it another way.
>>> Did I get that right? If so, I think it would be nice to have the
>>> option to tell scheduler to select job that has highest load instead
>>> of shortest run time.
>> there was a similar discussion about the slot-wise suspend on
>> subordination - which job to suspend? The best would be of course to
>> let the user decide whether it should be the one with the shortest or
>> the longest runtime (and maybe to correct the h_rt).
> In my case I had a computer with two jobs run by two different persons
> not working together. So it wouldn't always be relevent to think that
> "user should decide".

By user I meant us - the admins.


>> What do you mean by "highest load"? Every running process which is
>> eligible to be executed generates a load of 1. Do you mean parallel
>> jobs on a node?
> I had a computer with a load of more than 10 whereas there were only 4
> cores and 2 non-parallel jobs. For some reason (the user told me it
> could have been because of huge memory usage) one of these was  
> consuming
> way too much resources. After I killed it, system load lowered until
> about 1.

Yes, when it starts to swap you will get a load which reflects the  
paging processes. You could set up a hard memory limit which in total  
sums up to the physically installed memory:

http://gridengine.info/2009/12/01/adding-memory-requirement-awareness- 
to-the-scheduler

-- Reuti


>>> What often happens, I guess, is that suspend_thresholds is reached
>>> only when a job "goes mad" so that would make more sense to suspend
>>> this  one rather than another one running normally for a longer
>>> time.
>> Did you define more slots than installed cores? Nowadays the load is
>> a little bit misleading, as also uninterruptible kernel tasks will
>> increase the load, although they are waiting for the disk or alike
>> (state "D"). Maybe the feature of suspend_threshold isn't suited for
>> modern Linux systems at all.
> There were only 4 slots for 4 cores.
> Well, I experienced a case where load was absolutely relevent.
> -- 
> Bernard Massot - Bureau D4 - Département de physique
> École Normale Supérieure
> 24 rue Lhomond - 75005 Paris
> Tél: +33 1 44 32 25 89
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=238746
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=238749

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list