[GE users] soft wallclock limit exceeded: job not killed

Rayson Ho rayrayson at gmail.com
Wed Jan 9 17:50:50 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I guess you can use a custom "terminate_method", in which you can add
some delays before sending the real kill signal.

Rayson


On Jan 9, 2008 11:36 AM, Alois Dirnaichner
<Alois.Dirnaichner at physik.lmu.de> wrote:
> Ok, you are right. But h_rt kills the job immediately no matter if one
> catches USR2 or not...
> The kill command seems not to be delayed by notification time.
> I'd love to give our users a chance to dump whatsoever in prospect of
> the upcoming annihilation.
>
> >
> >>
> >>>
> >>> but obviously I didn't file an issue at that time.
> >>>
> >>> -- Reuti
> >>>
> >>>
> >>> Am 08.01.2008 um 17:16 schrieb Alois Dirnaichner:
> >>>
> >>>> Reuti wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Am 08.01.2008 um 12:04 schrieb Alois Dirnaichner:
> >>>>>
> >>>>>> we forced our users to submit with the s_rt flag to limit their
> >>>>>> job's
> >>>>>> runtimes.
> >>>>>> The maximum for their request is defined in the queue configuration.
> >>>>>> Notify time is one hour.
> >>>>>> What should happen is this (according to man pages and mailing
> >>>>>> list):
> >>>>>> After the job reaches s_rt limit, it is sent SIGUSR1 and after one
> >>>>>> additional hour to feather the nest it is killed.
> >>>>>
> >>>>> why one hour? You set this as h_rt - this will not be added but is a
> >>>>> limit on it's own? The sigusr1 is ignored by the script/program? -
> >>>>> Reuti
> >>>>
> >>>> RESOURCE LIMITS
> >>>>      [..] If  s_rt  is  exceeded,  the  job is first
> >>>>      "warned" via the SIGUSR1 signal (which can be caught by  the
> >>>>      job) and finally aborted after the notification time defined
> >>>>      in the queue configuration parameter notify (see above)  has
> >>>>      passed.
> >>>>
> >>>> with notify = 1:00:00
> >>>> so, if the job traps SIGUSR1(I don't know if the user did so) it is
> >>>> sure
> >>>> to be killed after one additional hour?!
> >>>> I've not set a h_rt, just s_rt and notify in queue conf and the user
> >>>> himself is forced to set s_rt (<= s_rt in queue conf).
> >>>
> >>> Aha - now I remember slightly: something was not working there - but
> >>> we never used it in the end and I stopped looking into it. If you have
> >>> "loglevel log_info" you might see every 90 seconds that the s_rt
> >>> method is started again and again.. Although there is nowhere
> >>> something with 90 seconds defined at all. I'm not sure, whether there
> >>> is already an issue for it.
> >>>
> >>> -- Reuti
> >>>
> >>>
> >>>> Is there any way to find out the actual runtime of an active job
> >>>> (up to
> >>>> now?)
> >>>> The submission time in qstat is not the starting time, I guess...
> >>>>
> >>>>>
> >>>>>> Correct me if I'm wrong.
> >>>>>>
> >>>>>> Nevertheless some jobs manage to escape the procedure:
> >>>>>>
> >>>>>> qstat:
> >>>>>> hard resource_list: h_vmem=4G,s_rt=1296000
> >>>>>> usage 34: cpu=15:15:03:33, mem=1556363.63071 GBs, io=0.00000,
> >>>>>> vmem=1.917G, maxvmem=1.918G
> >>>>>>
> >>>>>> 1296000s = 15d, hence, the job had flinged off its restraints. Is it
> >>>>>> possible that the cpu time is greater than the wallclock time?
> >>>>>> I don't know how to check the wallclock time with qstat. A quick
> >>>>>> check
> >>>>>> with ARCo unearthed other recent s_rt violations.
> >>>>>> What is happening?
> >>>>>> Yours,
> >>>>>>
> >>>>>> Al
> >>>>>>
> >>>>>>
> >>>>>> --Alois Dirnaichner
> >>>>>> http://www.theorie.physik.uni-muenchen.de/~al
> >>>>>>
> >>>>>> Rechnerbetriebsgruppe
> >>>>>> Arnold Sommerfeld Center
> >>>>>> Theresienstr. 39
> >>>>>> 80333 Muenchen
> >>>>>>
> >>>>>> ---------------------------------------------------------------------
> >>>>>>
> >>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --Alois Dirnaichner
> >>>> http://www.theorie.physik.uni-muenchen.de/~al
> >>>>
> >>>> Rechnerbetriebsgruppe
> >>>> Arnold Sommerfeld Center
> >>>> Theresienstr. 39
> >>>> 80333 Muenchen
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>>
> >>
> >>
> >> --
> >> Alois Dirnaichner
> >> http://www.theorie.physik.uni-muenchen.de/~al
> >>
> >> Rechnerbetriebsgruppe
> >> Arnold Sommerfeld Center
> >> Theresienstr. 39
> >> 80333 Muenchen
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
>
> --
>
>
> Alois Dirnaichner
> http://www.theorie.physik.uni-muenchen.de/~al
>
> Rechnerbetriebsgruppe
> Arnold Sommerfeld Center
> Theresienstr. 39
> 80333 Muenchen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list