[GE users] soft wallclock limit exceeded: job not killed

Alois Dirnaichner Alois.Dirnaichner at physik.lmu.de
Tue Jan 8 16:16:04 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> Hi,
>
> Am 08.01.2008 um 12:04 schrieb Alois Dirnaichner:
>
>> we forced our users to submit with the s_rt flag to limit their job's
>> runtimes.
>> The maximum for their request is defined in the queue configuration.
>> Notify time is one hour.
>> What should happen is this (according to man pages and mailing list):
>> After the job reaches s_rt limit, it is sent SIGUSR1 and after one
>> additional hour to feather the nest it is killed.
>
> why one hour? You set this as h_rt - this will not be added but is a
> limit on it's own? The sigusr1 is ignored by the script/program? - Reuti

RESOURCE LIMITS
     [..] If  s_rt  is  exceeded,  the  job is first
     "warned" via the SIGUSR1 signal (which can be caught by  the
     job) and finally aborted after the notification time defined
     in the queue configuration parameter notify (see above)  has
     passed.

with notify = 1:00:00
so, if the job traps SIGUSR1(I don't know if the user did so) it is sure
to be killed after one additional hour?!
I've not set a h_rt, just s_rt and notify in queue conf and the user
himself is forced to set s_rt (<= s_rt in queue conf).

Is there any way to find out the actual runtime of an active job (up to
now?)
The submission time in qstat is not the starting time, I guess...

>
>> Correct me if I'm wrong.
>>
>> Nevertheless some jobs manage to escape the procedure:
>>
>> qstat:
>> hard resource_list: h_vmem=4G,s_rt=1296000
>> usage 34: cpu=15:15:03:33, mem=1556363.63071 GBs, io=0.00000,
>> vmem=1.917G, maxvmem=1.918G
>>
>> 1296000s = 15d, hence, the job had flinged off its restraints. Is it
>> possible that the cpu time is greater than the wallclock time?
>> I don't know how to check the wallclock time with qstat. A quick check
>> with ARCo unearthed other recent s_rt violations.
>> What is happening?
>> Yours,
>>
>> Al
>>
>>
>> -- 
>> Alois Dirnaichner
>> http://www.theorie.physik.uni-muenchen.de/~al
>>
>> Rechnerbetriebsgruppe
>> Arnold Sommerfeld Center
>> Theresienstr. 39
>> 80333 Muenchen
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 

Alois Dirnaichner
http://www.theorie.physik.uni-muenchen.de/~al

Rechnerbetriebsgruppe
Arnold Sommerfeld Center
Theresienstr. 39
80333 Muenchen

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list