[GE users] soft wallclock limit exceeded: job not killed

Reuti reuti at staff.uni-marburg.de
Tue Jan 8 18:09:09 GMT 2008


Am 08.01.2008 um 17:16 schrieb Alois Dirnaichner:

> Reuti wrote:
>> Hi,
>>
>> Am 08.01.2008 um 12:04 schrieb Alois Dirnaichner:
>>
>>> we forced our users to submit with the s_rt flag to limit their  
>>> job's
>>> runtimes.
>>> The maximum for their request is defined in the queue configuration.
>>> Notify time is one hour.
>>> What should happen is this (according to man pages and mailing  
>>> list):
>>> After the job reaches s_rt limit, it is sent SIGUSR1 and after one
>>> additional hour to feather the nest it is killed.
>>
>> why one hour? You set this as h_rt - this will not be added but is a
>> limit on it's own? The sigusr1 is ignored by the script/program? -  
>> Reuti
>
> RESOURCE LIMITS
>      [..] If  s_rt  is  exceeded,  the  job is first
>      "warned" via the SIGUSR1 signal (which can be caught by  the
>      job) and finally aborted after the notification time defined
>      in the queue configuration parameter notify (see above)  has
>      passed.
>
> with notify = 1:00:00
> so, if the job traps SIGUSR1(I don't know if the user did so) it is  
> sure
> to be killed after one additional hour?!
> I've not set a h_rt, just s_rt and notify in queue conf and the user
> himself is forced to set s_rt (<= s_rt in queue conf).

Aha - now I remember slightly: something was not working there - but  
we never used it in the end and I stopped looking into it. If you  
have "loglevel log_info" you might see every 90 seconds that the s_rt  
method is started again and again.. Although there is nowhere  
something with 90 seconds defined at all. I'm not sure, whether there  
is already an issue for it.

-- Reuti


> Is there any way to find out the actual runtime of an active job  
> (up to
> now?)
> The submission time in qstat is not the starting time, I guess...
>
>>
>>> Correct me if I'm wrong.
>>>
>>> Nevertheless some jobs manage to escape the procedure:
>>>
>>> qstat:
>>> hard resource_list: h_vmem=4G,s_rt=1296000
>>> usage 34: cpu=15:15:03:33, mem=1556363.63071 GBs, io=0.00000,
>>> vmem=1.917G, maxvmem=1.918G
>>>
>>> 1296000s = 15d, hence, the job had flinged off its restraints. Is it
>>> possible that the cpu time is greater than the wallclock time?
>>> I don't know how to check the wallclock time with qstat. A quick  
>>> check
>>> with ARCo unearthed other recent s_rt violations.
>>> What is happening?
>>> Yours,
>>>
>>> Al
>>>
>>>
>>> -- 
>>> Alois Dirnaichner
>>> http://www.theorie.physik.uni-muenchen.de/~al
>>>
>>> Rechnerbetriebsgruppe
>>> Arnold Sommerfeld Center
>>> Theresienstr. 39
>>> 80333 Muenchen
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
>
> -- 
>
> Alois Dirnaichner
> http://www.theorie.physik.uni-muenchen.de/~al
>
> Rechnerbetriebsgruppe
> Arnold Sommerfeld Center
> Theresienstr. 39
> 80333 Muenchen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list