[GE users] Unexpected "s_rt" / "Notify time" behaviour.

joga Joachim.Gabler at sun.com
Wed Jun 24 13:46:44 BST 2009

Hi Neil,

this could be IZ 3017 (http://gridengine.sunsource.net/issues/show_bug.cgi?id=3017) which has been fixed recently for the 6.2u3 (to be available soon) and the 6.1u7 (probably coming in autumn).

Best regards,


On 06/24/09 13:46, futurity wrote:
Hi Fellow grid engine users,

I was wondering if someone could help me get to grips with ?s_rt? and ?notify time?.  I think I?m almost there, but I?m experiencing some unexpected ?Notify time? behaviour.

We?re using: Gird Engine ge-6.1u3-bin-lx24-x86 / ge-6.1u3-common on openSuse10.3 (32bit).

I?m trying to configure our grid so that users can specify their estimated run time using ?s_rt?, so that the scheduler knows roughly how long jobs should run for and therefore can calculate which machines to reserve and which jobs to back fill.  I?m using ?s_rt? because I the users are strongly objecting to jobs being killed, so instead I will catch this SIGUSR1 so that it won?t kill the jobs.

The wrapper script ?trapUsr1.sh? that catches the SIGUSR1 is as follows:

# Usage:
#        trapUsr1.sh my command line

trapFun() { echo "USR1 signal trapped on [`date`]"; }

trap trapFun USR1;
$*; # Command specified on command line gets launched here.

I?m submitting jobs using: qsub -l s_rt=<time> <path>trapUsr1.sh <user?s job>

The ?Notify Time? for all the queues has been set to 00:01:00 for testing.  The job is a simple script that sleeps for 1 hour.

When I submit a job using a ?s_rt? value of 00:01:00, the job only runs for 00:01:01 and even though the signal is caught and captured in the job?s output file, it seems that the notify time is ignored. Likewise, when I submit a job using a ?s_rt? value of 00:03:00, the job ends after 00:03:01.

The qconf man page says.

       The  time  to  wait between delivery of SIGUSR1/SIGUSR2 notification signals and suspend/kill signals if
       job was submitted with the qsub(1) -notify option.

So why are these jobs not killed 00:01:00 (1 minute) after the SIGUSR1 signal is received?

Any help would be very welcome as I?ve hit a brick wall.


