[GE users] Unexpected "s_rt" / "Notify time" behaviour.

futurity neil at futurity.co.uk
Wed Jun 24 12:46:25 BST 2009


    [ The following text is in the "Windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Fellow grid engine users,

I was wondering if someone could help me get to grips with ?s_rt? and ?notify time?.  I think I?m almost there, but I?m experiencing some unexpected ?Notify time? behaviour.

We?re using: Gird Engine ge-6.1u3-bin-lx24-x86 / ge-6.1u3-common on openSuse10.3 (32bit).

I?m trying to configure our grid so that users can specify their estimated run time using ?s_rt?, so that the scheduler knows roughly how long jobs should run for and therefore can calculate which machines to reserve and which jobs to back fill.  I?m using ?s_rt? because I the users are strongly objecting to jobs being killed, so instead I will catch this SIGUSR1 so that it won?t kill the jobs.

The wrapper script ?trapUsr1.sh? that catches the SIGUSR1 is as follows:
>>>>>>>>
#!/bin/bash

# Usage:
#        trapUsr1.sh my command line

trapFun() { echo "USR1 signal trapped on [`date`]"; }

trap trapFun USR1;
$*; # Command specified on command line gets launched here.
>>>>>>>>

I?m submitting jobs using: qsub -l s_rt=<time> <path>trapUsr1.sh <user?s job>

The ?Notify Time? for all the queues has been set to 00:01:00 for testing.  The job is a simple script that sleeps for 1 hour.

When I submit a job using a ?s_rt? value of 00:01:00, the job only runs for 00:01:01 and even though the signal is caught and captured in the job?s output file, it seems that the notify time is ignored. Likewise, when I submit a job using a ?s_rt? value of 00:03:00, the job ends after 00:03:01.

The qconf man page says.

   notify
       The  time  to  wait between delivery of SIGUSR1/SIGUSR2 notification signals and suspend/kill signals if
       job was submitted with the qsub(1) -notify option.

So why are these jobs not killed 00:01:00 (1 minute) after the SIGUSR1 signal is received?

Any help would be very welcome as I?ve hit a brick wall.

Neil




More information about the gridengine-users mailing list