[GE users] soft wallclock limit exceeded: job not killed

Reuti reuti at staff.uni-marburg.de
Wed Jan 9 15:08:02 GMT 2008


Am 09.01.2008 um 15:32 schrieb Alois Dirnaichner:

> Reuti wrote:
>> This was the post I remembered:
>>
>> http://gridengine.sunsource.net/servlets/ReadMsg? 
>> listName=users&msgNo=7491
>>
>
> In this post you speak of h_rt:
>
> qsub -notify -l h_rt=... and will give you USR2 and the kill  
> command is delayed by notify-time.
>
>
>
> which is exactly the functionality I want to achieve.
> If it works this way, I suggest you to change the manual and we use  
> h_rt
> instead of s_rt.
>
> Besides, where are these logfiles to proof your assumption?
> I searched at sge-root/cell/spool/qmaster/messages and at the
> execd_spool_dir on the exec node.

In SGE's messages file of the node, for us in /var/spool/sge (default  
is $SGE_ROOT/default/spool):

reuti at node41:~> o /var/spool/sge/node41/messages
...
01/08/2008 18:22:45|execd|node41|W|job 60721.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:24:16|execd|node41|W|job 60721.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:25:47|execd|node41|W|job 60721.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:27:18|execd|node41|W|job 60721.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:28:49|execd|node41|W|job 60721.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:30:20|execd|node41|W|job 60721.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:38:58|execd|node41|W|job 60723.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:40:29|execd|node41|W|job 60723.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:42:00|execd|node41|W|job 60723.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:43:31|execd|node41|W|job 60723.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:45:02|execd|node41|W|job 60723.1 exceeded soft  
wallclock time - initiate soft notify method
01/08/2008 18:46:33|execd|node41|W|job 60723.1 exceeded soft  
wallclock time - initiate soft notify method

-- Reuti

>
>>
>> but obviously I didn't file an issue at that time.
>>
>> -- Reuti
>>
>>
>> Am 08.01.2008 um 17:16 schrieb Alois Dirnaichner:
>>
>>> Reuti wrote:
>>>> Hi,
>>>>
>>>> Am 08.01.2008 um 12:04 schrieb Alois Dirnaichner:
>>>>
>>>>> we forced our users to submit with the s_rt flag to limit their  
>>>>> job's
>>>>> runtimes.
>>>>> The maximum for their request is defined in the queue  
>>>>> configuration.
>>>>> Notify time is one hour.
>>>>> What should happen is this (according to man pages and mailing  
>>>>> list):
>>>>> After the job reaches s_rt limit, it is sent SIGUSR1 and after one
>>>>> additional hour to feather the nest it is killed.
>>>>
>>>> why one hour? You set this as h_rt - this will not be added but  
>>>> is a
>>>> limit on it's own? The sigusr1 is ignored by the script/program? -
>>>> Reuti
>>>
>>> RESOURCE LIMITS
>>>      [..] If  s_rt  is  exceeded,  the  job is first
>>>      "warned" via the SIGUSR1 signal (which can be caught by  the
>>>      job) and finally aborted after the notification time defined
>>>      in the queue configuration parameter notify (see above)  has
>>>      passed.
>>>
>>> with notify = 1:00:00
>>> so, if the job traps SIGUSR1(I don't know if the user did so) it  
>>> is sure
>>> to be killed after one additional hour?!
>>> I've not set a h_rt, just s_rt and notify in queue conf and the user
>>> himself is forced to set s_rt (<= s_rt in queue conf).
>>
>> Aha - now I remember slightly: something was not working there - but
>> we never used it in the end and I stopped looking into it. If you  
>> have
>> "loglevel log_info" you might see every 90 seconds that the s_rt
>> method is started again and again.. Although there is nowhere
>> something with 90 seconds defined at all. I'm not sure, whether there
>> is already an issue for it.
>>
>> -- Reuti
>>
>>
>>> Is there any way to find out the actual runtime of an active job  
>>> (up to
>>> now?)
>>> The submission time in qstat is not the starting time, I guess...
>>>
>>>>
>>>>> Correct me if I'm wrong.
>>>>>
>>>>> Nevertheless some jobs manage to escape the procedure:
>>>>>
>>>>> qstat:
>>>>> hard resource_list: h_vmem=4G,s_rt=1296000
>>>>> usage 34: cpu=15:15:03:33, mem=1556363.63071 GBs, io=0.00000,
>>>>> vmem=1.917G, maxvmem=1.918G
>>>>>
>>>>> 1296000s = 15d, hence, the job had flinged off its restraints.  
>>>>> Is it
>>>>> possible that the cpu time is greater than the wallclock time?
>>>>> I don't know how to check the wallclock time with qstat. A  
>>>>> quick check
>>>>> with ARCo unearthed other recent s_rt violations.
>>>>> What is happening?
>>>>> Yours,
>>>>>
>>>>> Al
>>>>>
>>>>>
>>>>> --Alois Dirnaichner
>>>>> http://www.theorie.physik.uni-muenchen.de/~al
>>>>>
>>>>> Rechnerbetriebsgruppe
>>>>> Arnold Sommerfeld Center
>>>>> Theresienstr. 39
>>>>> 80333 Muenchen
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Alois Dirnaichner
>>> http://www.theorie.physik.uni-muenchen.de/~al
>>>
>>> Rechnerbetriebsgruppe
>>> Arnold Sommerfeld Center
>>> Theresienstr. 39
>>> 80333 Muenchen
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
>
> -- 
>
> Alois Dirnaichner
> http://www.theorie.physik.uni-muenchen.de/~al
>
> Rechnerbetriebsgruppe
> Arnold Sommerfeld Center
> Theresienstr. 39
> 80333 Muenchen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list