[GE users] soft wallclock limit exceeded: job not killed

Alois Dirnaichner Alois.Dirnaichner at physik.lmu.de
Tue Jan 8 11:04:44 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

we forced our users to submit with the s_rt flag to limit their job's
runtimes.
The maximum for their request is defined in the queue configuration.
Notify time is one hour.
What should happen is this (according to man pages and mailing list):
After the job reaches s_rt limit, it is sent SIGUSR1 and after one
additional hour to feather the nest it is killed.
Correct me if I'm wrong.

Nevertheless some jobs manage to escape the procedure:

qstat:
hard resource_list: h_vmem=4G,s_rt=1296000
usage 34: cpu=15:15:03:33, mem=1556363.63071 GBs, io=0.00000,
vmem=1.917G, maxvmem=1.918G

1296000s = 15d, hence, the job had flinged off its restraints. Is it
possible that the cpu time is greater than the wallclock time?
I don't know how to check the wallclock time with qstat. A quick check
with ARCo unearthed other recent s_rt violations.
What is happening?
Yours,

Al


-- 

Alois Dirnaichner
http://www.theorie.physik.uni-muenchen.de/~al

Rechnerbetriebsgruppe
Arnold Sommerfeld Center
Theresienstr. 39
80333 Muenchen

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list