[GE users] soft wallclock limit exceeded: job not killed

Reuti reuti at staff.uni-marburg.de
Wed Jan 9 18:46:34 GMT 2008


Am 09.01.2008 um 17:36 schrieb Alois Dirnaichner:

> Reuti wrote:
>> Am 09.01.2008 um 15:32 schrieb Alois Dirnaichner:
>>
>>> Reuti wrote:
>>>> This was the post I remembered:
>>>>
>>>> http://gridengine.sunsource.net/servlets/ReadMsg? 
>>>> listName=users&msgNo=7491
>>>>
>>>>
>>>
>>> In this post you speak of h_rt:
>>>
>>> qsub -notify -l h_rt=... and will give you USR2 and the kill command
>>> is delayed by notify-time.
>>>
>>>
>>>
>>> which is exactly the functionality I want to achieve.
>>> If it works this way, I suggest you to change the manual and we  
>>> use h_rt
>>> instead of s_rt.
>>>
>>> Besides, where are these logfiles to proof your assumption?
>>> I searched at sge-root/cell/spool/qmaster/messages and at the
>>> execd_spool_dir on the exec node.
>>
>> In SGE's messages file of the node, for us in /var/spool/sge (default
>> is $SGE_ROOT/default/spool):
>>
>> reuti at node41:~> o /var/spool/sge/node41/messages
>> ...
>> 01/08/2008 18:22:45|execd|node41|W|job 60721.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:24:16|execd|node41|W|job 60721.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:25:47|execd|node41|W|job 60721.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:27:18|execd|node41|W|job 60721.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:28:49|execd|node41|W|job 60721.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:30:20|execd|node41|W|job 60721.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:38:58|execd|node41|W|job 60723.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:40:29|execd|node41|W|job 60723.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:42:00|execd|node41|W|job 60723.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:43:31|execd|node41|W|job 60723.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:45:02|execd|node41|W|job 60723.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>> 01/08/2008 18:46:33|execd|node41|W|job 60723.1 exceeded soft  
>> wallclock
>> time - initiate soft notify method
>>
>> -- Reuti
>
> Ok, you are right. But h_rt kills the job immediately no matter if one
> catches USR2 or not...
> The kill command seems not to be delayed by notification time.
> I'd love to give our users a chance to dump whatsoever in prospect of
> the upcoming annihilation.

Didi you request also -notify? For me it seems working in 6.0u9:

On the node (well, two times it's superfluous - again after 90  
seconds - so something is there also wrong):

01/09/2008 19:32:49|execd|node44|W|job 61267.1 exceeded hard  
wallclock time - initiate terminate method
01/09/2008 19:34:20|execd|node44|W|job 61267.1 exceeded hard  
wallclock time - initiate terminate method

On SGE master:

01/09/2008 19:34:49|qmaster|master|W|job 61267.1 failed on host  
node44 assumedly after job because: job 61267.1 died through signal  
KILL (9)

qsub_time    Wed Jan  9 19:28:44 2008
start_time   Wed Jan  9 19:28:48 2008
end_time     Wed Jan  9 19:34:49 2008

h_rt=240
notify=0:2:0

-- Reuti

>>
>>>
>>>>
>>>> but obviously I didn't file an issue at that time.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>> Am 08.01.2008 um 17:16 schrieb Alois Dirnaichner:
>>>>
>>>>> Reuti wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Am 08.01.2008 um 12:04 schrieb Alois Dirnaichner:
>>>>>>
>>>>>>> we forced our users to submit with the s_rt flag to limit their
>>>>>>> job's
>>>>>>> runtimes.
>>>>>>> The maximum for their request is defined in the queue  
>>>>>>> configuration.
>>>>>>> Notify time is one hour.
>>>>>>> What should happen is this (according to man pages and mailing
>>>>>>> list):
>>>>>>> After the job reaches s_rt limit, it is sent SIGUSR1 and  
>>>>>>> after one
>>>>>>> additional hour to feather the nest it is killed.
>>>>>>
>>>>>> why one hour? You set this as h_rt - this will not be added  
>>>>>> but is a
>>>>>> limit on it's own? The sigusr1 is ignored by the script/ 
>>>>>> program? -
>>>>>> Reuti
>>>>>
>>>>> RESOURCE LIMITS
>>>>>      [..] If  s_rt  is  exceeded,  the  job is first
>>>>>      "warned" via the SIGUSR1 signal (which can be caught by  the
>>>>>      job) and finally aborted after the notification time defined
>>>>>      in the queue configuration parameter notify (see above)  has
>>>>>      passed.
>>>>>
>>>>> with notify = 1:00:00
>>>>> so, if the job traps SIGUSR1(I don't know if the user did so)  
>>>>> it is
>>>>> sure
>>>>> to be killed after one additional hour?!
>>>>> I've not set a h_rt, just s_rt and notify in queue conf and the  
>>>>> user
>>>>> himself is forced to set s_rt (<= s_rt in queue conf).
>>>>
>>>> Aha - now I remember slightly: something was not working there -  
>>>> but
>>>> we never used it in the end and I stopped looking into it. If  
>>>> you have
>>>> "loglevel log_info" you might see every 90 seconds that the s_rt
>>>> method is started again and again.. Although there is nowhere
>>>> something with 90 seconds defined at all. I'm not sure, whether  
>>>> there
>>>> is already an issue for it.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> Is there any way to find out the actual runtime of an active job
>>>>> (up to
>>>>> now?)
>>>>> The submission time in qstat is not the starting time, I guess...
>>>>>
>>>>>>
>>>>>>> Correct me if I'm wrong.
>>>>>>>
>>>>>>> Nevertheless some jobs manage to escape the procedure:
>>>>>>>
>>>>>>> qstat:
>>>>>>> hard resource_list: h_vmem=4G,s_rt=1296000
>>>>>>> usage 34: cpu=15:15:03:33, mem=1556363.63071 GBs, io=0.00000,
>>>>>>> vmem=1.917G, maxvmem=1.918G
>>>>>>>
>>>>>>> 1296000s = 15d, hence, the job had flinged off its  
>>>>>>> restraints. Is it
>>>>>>> possible that the cpu time is greater than the wallclock time?
>>>>>>> I don't know how to check the wallclock time with qstat. A quick
>>>>>>> check
>>>>>>> with ARCo unearthed other recent s_rt violations.
>>>>>>> What is happening?
>>>>>>> Yours,
>>>>>>>
>>>>>>> Al
>>>>>>>
>>>>>>>
>>>>>>> --Alois Dirnaichner
>>>>>>> http://www.theorie.physik.uni-muenchen.de/~al
>>>>>>>
>>>>>>> Rechnerbetriebsgruppe
>>>>>>> Arnold Sommerfeld Center
>>>>>>> Theresienstr. 39
>>>>>>> 80333 Muenchen
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> -----
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users- 
>>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ----
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users- 
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --Alois Dirnaichner
>>>>> http://www.theorie.physik.uni-muenchen.de/~al
>>>>>
>>>>> Rechnerbetriebsgruppe
>>>>> Arnold Sommerfeld Center
>>>>> Theresienstr. 39
>>>>> 80333 Muenchen
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Alois Dirnaichner
>>> http://www.theorie.physik.uni-muenchen.de/~al
>>>
>>> Rechnerbetriebsgruppe
>>> Arnold Sommerfeld Center
>>> Theresienstr. 39
>>> 80333 Muenchen
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
>
> -- 
>
> Alois Dirnaichner
> http://www.theorie.physik.uni-muenchen.de/~al
>
> Rechnerbetriebsgruppe
> Arnold Sommerfeld Center
> Theresienstr. 39
> 80333 Muenchen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list