[GE users] Unexpected "s_rt" / "Notify time" behaviour.

futurity neil at futurity.co.uk
Wed Jun 24 17:51:33 BST 2009


    [ The following text is in the "Windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Joachim and other grid users,

Thank you Joachim for your reply.

Can anyone confirm that the problem I am experiencing is related to the bug Joachim so kindly pointed out?  The method I?m using has been recommended to me in the past on this mailing list and I would be surprised if its been recommended if in reality it doesn?t work.

Is there any other way of specifying job run times without jobs being killed when the run time limit is reached?  It seems very restrictive that the only way you can provide run time estimates is if you have to allow the engine to kill all jobs that over run.  I understand the logic behind enforcing limits on jobs, but in our case we need resource reservation and back filling, without the hard limits and killing of jobs.  If a job over runs, in out case it needs to and we still want the results.  We don?t want a job killed after 3 days just because it?s over run by ? a day.

If it?s a bug, then we can wait, but if it?s just the way it works and there is no way around it with ?s_rt? and ?Notify time? then I?m really stuck and will have to try and think of something else.

Neil

________________________________
From: Joachim.Gabler at Sun.COM [mailto:Joachim.Gabler at Sun.COM]
Sent: 24 June 2009 13:47
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Unexpected "s_rt" / "Notify time" behaviour.

Hi Neil,

this could be IZ 3017 (http://gridengine.sunsource.net/issues/show_bug.cgi?id=3017) which has been fixed recently for the 6.2u3 (to be available soon) and the 6.1u7 (probably coming in autumn).

Best regards,

   Joachim

On 06/24/09 13:46, futurity wrote:
Hi Fellow grid engine users,

I was wondering if someone could help me get to grips with ?s_rt? and ?notify time?.  I think I?m almost there, but I?m experiencing some unexpected ?Notify time? behaviour.

We?re using: Gird Engine ge-6.1u3-bin-lx24-x86 / ge-6.1u3-common on openSuse10.3 (32bit).

I?m trying to configure our grid so that users can specify their estimated run time using ?s_rt?, so that the scheduler knows roughly how long jobs should run for and therefore can calculate which machines to reserve and which jobs to back fill.  I?m using ?s_rt? because I the users are strongly objecting to jobs being killed, so instead I will catch this SIGUSR1 so that it won?t kill the jobs.

The wrapper script ?trapUsr1.sh? that catches the SIGUSR1 is as follows:
>>>>>>>>
#!/bin/bash

# Usage:
#        trapUsr1.sh my command line

trapFun() { echo "USR1 signal trapped on [`date`]"; }

trap trapFun USR1;
$*; # Command specified on command line gets launched here.
>>>>>>>>

I?m submitting jobs using: qsub -l s_rt=<time> <path>trapUsr1.sh <user?s job>

The ?Notify Time? for all the queues has been set to 00:01:00 for testing.  The job is a simple script that sleeps for 1 hour.

When I submit a job using a ?s_rt? value of 00:01:00, the job only runs for 00:01:01 and even though the signal is caught and captured in the job?s output file, it seems that the notify time is ignored. Likewise, when I submit a job using a ?s_rt? value of 00:03:00, the job ends after 00:03:01.

The qconf man page says.

   notify
       The  time  to  wait between delivery of SIGUSR1/SIGUSR2 notification signals and suspend/kill signals if
       job was submitted with the qsub(1) -notify option.

So why are these jobs not killed 00:01:00 (1 minute) after the SIGUSR1 signal is received?

Any help would be very welcome as I?ve hit a brick wall.

Neil





More information about the gridengine-users mailing list