[GE users] soft wallclock limit exceeded: job not killed

Alois Dirnaichner Alois.Dirnaichner at physik.lmu.de
Wed Jan 9 16:36:12 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> Am 09.01.2008 um 15:32 schrieb Alois Dirnaichner:
>
>> Reuti wrote:
>>> This was the post I remembered:
>>>
>>> http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=7491
>>>
>>>
>>
>> In this post you speak of h_rt:
>>
>> qsub -notify -l h_rt=... and will give you USR2 and the kill command
>> is delayed by notify-time.
>>
>>
>>
>> which is exactly the functionality I want to achieve.
>> If it works this way, I suggest you to change the manual and we use h_rt
>> instead of s_rt.
>>
>> Besides, where are these logfiles to proof your assumption?
>> I searched at sge-root/cell/spool/qmaster/messages and at the
>> execd_spool_dir on the exec node.
>
> In SGE's messages file of the node, for us in /var/spool/sge (default
> is $SGE_ROOT/default/spool):
>
> reuti at node41:~> o /var/spool/sge/node41/messages
> ...
> 01/08/2008 18:22:45|execd|node41|W|job 60721.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:24:16|execd|node41|W|job 60721.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:25:47|execd|node41|W|job 60721.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:27:18|execd|node41|W|job 60721.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:28:49|execd|node41|W|job 60721.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:30:20|execd|node41|W|job 60721.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:38:58|execd|node41|W|job 60723.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:40:29|execd|node41|W|job 60723.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:42:00|execd|node41|W|job 60723.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:43:31|execd|node41|W|job 60723.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:45:02|execd|node41|W|job 60723.1 exceeded soft wallclock
> time - initiate soft notify method
> 01/08/2008 18:46:33|execd|node41|W|job 60723.1 exceeded soft wallclock
> time - initiate soft notify method
>
> -- Reuti

Ok, you are right. But h_rt kills the job immediately no matter if one
catches USR2 or not...
The kill command seems not to be delayed by notification time.
I'd love to give our users a chance to dump whatsoever in prospect of
the upcoming annihilation.
>
>>
>>>
>>> but obviously I didn't file an issue at that time.
>>>
>>> -- Reuti
>>>
>>>
>>> Am 08.01.2008 um 17:16 schrieb Alois Dirnaichner:
>>>
>>>> Reuti wrote:
>>>>> Hi,
>>>>>
>>>>> Am 08.01.2008 um 12:04 schrieb Alois Dirnaichner:
>>>>>
>>>>>> we forced our users to submit with the s_rt flag to limit their
>>>>>> job's
>>>>>> runtimes.
>>>>>> The maximum for their request is defined in the queue configuration.
>>>>>> Notify time is one hour.
>>>>>> What should happen is this (according to man pages and mailing
>>>>>> list):
>>>>>> After the job reaches s_rt limit, it is sent SIGUSR1 and after one
>>>>>> additional hour to feather the nest it is killed.
>>>>>
>>>>> why one hour? You set this as h_rt - this will not be added but is a
>>>>> limit on it's own? The sigusr1 is ignored by the script/program? -
>>>>> Reuti
>>>>
>>>> RESOURCE LIMITS
>>>>      [..] If  s_rt  is  exceeded,  the  job is first
>>>>      "warned" via the SIGUSR1 signal (which can be caught by  the
>>>>      job) and finally aborted after the notification time defined
>>>>      in the queue configuration parameter notify (see above)  has
>>>>      passed.
>>>>
>>>> with notify = 1:00:00
>>>> so, if the job traps SIGUSR1(I don't know if the user did so) it is
>>>> sure
>>>> to be killed after one additional hour?!
>>>> I've not set a h_rt, just s_rt and notify in queue conf and the user
>>>> himself is forced to set s_rt (<= s_rt in queue conf).
>>>
>>> Aha - now I remember slightly: something was not working there - but
>>> we never used it in the end and I stopped looking into it. If you have
>>> "loglevel log_info" you might see every 90 seconds that the s_rt
>>> method is started again and again.. Although there is nowhere
>>> something with 90 seconds defined at all. I'm not sure, whether there
>>> is already an issue for it.
>>>
>>> -- Reuti
>>>
>>>
>>>> Is there any way to find out the actual runtime of an active job
>>>> (up to
>>>> now?)
>>>> The submission time in qstat is not the starting time, I guess...
>>>>
>>>>>
>>>>>> Correct me if I'm wrong.
>>>>>>
>>>>>> Nevertheless some jobs manage to escape the procedure:
>>>>>>
>>>>>> qstat:
>>>>>> hard resource_list: h_vmem=4G,s_rt=1296000
>>>>>> usage 34: cpu=15:15:03:33, mem=1556363.63071 GBs, io=0.00000,
>>>>>> vmem=1.917G, maxvmem=1.918G
>>>>>>
>>>>>> 1296000s = 15d, hence, the job had flinged off its restraints. Is it
>>>>>> possible that the cpu time is greater than the wallclock time?
>>>>>> I don't know how to check the wallclock time with qstat. A quick
>>>>>> check
>>>>>> with ARCo unearthed other recent s_rt violations.
>>>>>> What is happening?
>>>>>> Yours,
>>>>>>
>>>>>> Al
>>>>>>
>>>>>>
>>>>>> --Alois Dirnaichner
>>>>>> http://www.theorie.physik.uni-muenchen.de/~al
>>>>>>
>>>>>> Rechnerbetriebsgruppe
>>>>>> Arnold Sommerfeld Center
>>>>>> Theresienstr. 39
>>>>>> 80333 Muenchen
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>
>>>>
>>>> --Alois Dirnaichner
>>>> http://www.theorie.physik.uni-muenchen.de/~al
>>>>
>>>> Rechnerbetriebsgruppe
>>>> Arnold Sommerfeld Center
>>>> Theresienstr. 39
>>>> 80333 Muenchen
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>>
>> -- 
>> Alois Dirnaichner
>> http://www.theorie.physik.uni-muenchen.de/~al
>>
>> Rechnerbetriebsgruppe
>> Arnold Sommerfeld Center
>> Theresienstr. 39
>> 80333 Muenchen
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 

Alois Dirnaichner
http://www.theorie.physik.uni-muenchen.de/~al

Rechnerbetriebsgruppe
Arnold Sommerfeld Center
Theresienstr. 39
80333 Muenchen

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list