[GE users] "h_rt" or "s_rt" for predicting job end times

futurity neil at futurity.co.uk
Thu Feb 12 14:54:46 GMT 2009

    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]


Thank you Brian for your replay.

I can't see my users coding to listen for a SIGUSR1, so as you say, we may as well use "h_rt".

Saying that, is there any way for a job to specify the amount of time a job should run for without it being terminated if it runs for too long? 



-----Original Message-----
From: brs [mailto:brs at usf.edu] 
Sent: 12 February 2009 14:44
To: users at gridengine.sunsource.net
Subject: Re: [GE users] "h_rt" or "s_rt" for predicting job end times


IIRC, They should both override default_duration (in the sched_conf) in order to tell the scheduler "how long" a job should run. The difference is that s_rt will send a SIGUSR1 to your process some time before it sends a SIGTERM. This allows you to implement a signal handler in your job to properly handle job termination. This value is set in the queue configuration in the 'notify' field. h_rt does not provide a SIGUSR1 and, I believe, sends a SIGKILL to the processes, terminating them ungracefully. This can be bad if any of your jobs catch SIGTERM in order to facilitate a clean-up process before exiting, BUT I have not seen very many codes that catch SIGTERM or SIGUSR1 (on our system at least) and so most of our users use h_rt.

Best Regards,
Brian Smith

futurity wrote:
> Hi,
> We?re using Grid Engine 6.1 and need some help deciding which out of 
> "h_rt" and "s_rt" our jobs should be using in order to help the 
> scheduler predict when jobs will finish.
> When I posted recently about our reservation problems, Reuti suggested 
> I look into using ?h_rt?. Unfortunately the Admin and User PDF guides 
> don?t contain any information on either "h_rt" or "s_rt", so I had to 
> experiment to find out what it does.
> From my experiments, it appears that ?h_rt? sets a run time per job, 
> which is used by the scheduler to predict when jobs finish.
> Unfortunately, it causes jobs to be terminated if they run for longer 
> than this specified time. I?m guessing that ?h_? stands for a hard 
> limit and this is why jobs are terminated when then exceed this?
> I?m guessing that ?s_rt? is a soft limit? I?m hoping that this means 
> that once the time specified by the job is reached, that it does ?NOT?
> terminate the job? i.e. if the user specified the wrong time limit by 
> accident, or the job ran slower for some reason, that the job would be 
> allowed to continue running?
> Does anyone know if ?s_rt? is also used by the scheduler in the same 
> way that ?h_rt? is used and if the only difference would be that one 
> terminates and the other doesn?t?
> Sorry for all these questions but I can?t seem to find any 
> documentation on these two settings. If anyone can point me at some 
> documentation it would be really appreciated.
> Many thanks,
> Neil

Brian Smith
Sr. HPC Systems Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. ENB308
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list