[GE users] soft wallclock limit exceeded: job not killed

Alois Dirnaichner Alois.Dirnaichner at physik.lmu.de
Wed Jan 9 14:32:44 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> This was the post I remembered:
>
> http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=7491
>

In this post you speak of h_rt:

qsub -notify -l h_rt=... and will give you USR2 and the kill command is delayed by notify-time.



which is exactly the functionality I want to achieve.
If it works this way, I suggest you to change the manual and we use h_rt
instead of s_rt.

Besides, where are these logfiles to proof your assumption?
I searched at sge-root/cell/spool/qmaster/messages and at the
execd_spool_dir on the exec node.


>
> but obviously I didn't file an issue at that time.
>
> -- Reuti
>
>
> Am 08.01.2008 um 17:16 schrieb Alois Dirnaichner:
>
>> Reuti wrote:
>>> Hi,
>>>
>>> Am 08.01.2008 um 12:04 schrieb Alois Dirnaichner:
>>>
>>>> we forced our users to submit with the s_rt flag to limit their job's
>>>> runtimes.
>>>> The maximum for their request is defined in the queue configuration.
>>>> Notify time is one hour.
>>>> What should happen is this (according to man pages and mailing list):
>>>> After the job reaches s_rt limit, it is sent SIGUSR1 and after one
>>>> additional hour to feather the nest it is killed.
>>>
>>> why one hour? You set this as h_rt - this will not be added but is a
>>> limit on it's own? The sigusr1 is ignored by the script/program? -
>>> Reuti
>>
>> RESOURCE LIMITS
>>      [..] If  s_rt  is  exceeded,  the  job is first
>>      "warned" via the SIGUSR1 signal (which can be caught by  the
>>      job) and finally aborted after the notification time defined
>>      in the queue configuration parameter notify (see above)  has
>>      passed.
>>
>> with notify = 1:00:00
>> so, if the job traps SIGUSR1(I don't know if the user did so) it is sure
>> to be killed after one additional hour?!
>> I've not set a h_rt, just s_rt and notify in queue conf and the user
>> himself is forced to set s_rt (<= s_rt in queue conf).
>
> Aha - now I remember slightly: something was not working there - but
> we never used it in the end and I stopped looking into it. If you have
> "loglevel log_info" you might see every 90 seconds that the s_rt
> method is started again and again.. Although there is nowhere
> something with 90 seconds defined at all. I'm not sure, whether there
> is already an issue for it.
>
> -- Reuti
>
>
>> Is there any way to find out the actual runtime of an active job (up to
>> now?)
>> The submission time in qstat is not the starting time, I guess...
>>
>>>
>>>> Correct me if I'm wrong.
>>>>
>>>> Nevertheless some jobs manage to escape the procedure:
>>>>
>>>> qstat:
>>>> hard resource_list: h_vmem=4G,s_rt=1296000
>>>> usage 34: cpu=15:15:03:33, mem=1556363.63071 GBs, io=0.00000,
>>>> vmem=1.917G, maxvmem=1.918G
>>>>
>>>> 1296000s = 15d, hence, the job had flinged off its restraints. Is it
>>>> possible that the cpu time is greater than the wallclock time?
>>>> I don't know how to check the wallclock time with qstat. A quick check
>>>> with ARCo unearthed other recent s_rt violations.
>>>> What is happening?
>>>> Yours,
>>>>
>>>> Al
>>>>
>>>>
>>>> --Alois Dirnaichner
>>>> http://www.theorie.physik.uni-muenchen.de/~al
>>>>
>>>> Rechnerbetriebsgruppe
>>>> Arnold Sommerfeld Center
>>>> Theresienstr. 39
>>>> 80333 Muenchen
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>>
>> -- 
>> Alois Dirnaichner
>> http://www.theorie.physik.uni-muenchen.de/~al
>>
>> Rechnerbetriebsgruppe
>> Arnold Sommerfeld Center
>> Theresienstr. 39
>> 80333 Muenchen
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 

Alois Dirnaichner
http://www.theorie.physik.uni-muenchen.de/~al

Rechnerbetriebsgruppe
Arnold Sommerfeld Center
Theresienstr. 39
80333 Muenchen

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list