[GE users] Prevent job execution if a certain wallclock time value is exceeded

reuti reuti at staff.uni-marburg.de
Sun Nov 1 17:45:12 GMT 2009


Am 29.10.2009 um 14:06 schrieb rafaarco:

> Hello everyone,
> I would like to know the best way in SGE to implement the following
> restriction:
> Let's suppose every user is given 1000 hours of wallclock time in  
> order
> to run his jobs. Every time a user submits a job and finishes, this
> job's total wallclock time (slots * actual wallclock time) is  
> subtracted
> from the initial user's time. Finally, when this time falls to 0, the
> user is prevented to run (ideally, submit) more jobs.

at time of submission the job can still be eligible to be submitted.  
But when it would start, the user can be over the limit already. One  
place to define such a thing when the user can't run anything could  
be a global "xuser_lists disabled" (qconf -mconf) and adjust the  
entries in the "disabled" list (it's just a conventional user list,  
which name you can adjust).

For a check at submission time a JSV (job submission verifier) would  
do the job (whether the requested wallclock time of the job would  
pass the limit, if the user is already in the above "disabled" list,  
he can't submit anything anyway).

> I am thinking of using prolog and epilog to do it. The initial user  
> time
> is stored somewhere (e.g., a database) and when a job finishes its
> wallclock time is fetched via qacct and the total time left is  
> updated.
> When a job is started, time is checked so that if job's h_rt*slots is
> higher than the time left, job fails to start.

AFAIK the accounting record is written after the job had finished,  
hence in the epilog you can use qacct only to collect the information  
of the slave tasks of a parallel job. Also for parallel jobs it might  
be good to set "accounting_summary TRUE", otherwise you have to  
summarize all records for one job on your own.

The only option to cover all, is a separate dawmon which will check  
the accounting and granted times. When an user reached his limit, he  
should be added to the "disabled" list and all of his jobs killed.

I wonder, whether this is worth an RFE. There is no hook in SGE,  
where you can define a job-epilog, which is run after the job  
completed and all accounting records were written for sure. Such a  
process would then run under the SGE admin user (or root) and of  
course wouldn't count against the job wallclock time. Maybe for now  
the entry "mailer" can be abused fur such a thing (it's started by  
the sge_execd), but this would also mean to check wether it's because  
of the begin (b) of a job or because of an abort or end (ea).

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list