[GE users] Cleanup on Rescheduling and Deleting

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Mon Jan 24 20:26:31 GMT 2005


It turns out I had a bug in my testing script.  Once I fixed it, I have
found that I can reliably get a USR2 signal for rescheduling and
deleting if you use "qsub -notify".  This is good news for me.

However, there is indeed a problem with overlapping execution when a job
is rescheduled.  That is, the first job continues to run for some time
after the rescheduled job is has started up.  I've tried to provide the
information need to reproduce this.  Perhaps someone knows of some
parameters that govern this overlapping.  It happens both with the qmon
GUI and the qmod command line.

The worst thing, as I've said before, is that the output of overlapping
jobs fight for use of the stdout/stderr files with one or the other
getting in, but not both.  In my test case, I actually have three jobs
overlapping, but it is just a test case.

Here is a testing script I have been using:

#! /bin/bash

#$ -o $HOME/gridoutput/$JOB_NAME.out -j y
#$ -S /bin/bash

set -u

# Set up restart status
: ${RESTARTED=0}

# Get our original execution search path if being run by SGE
: ${SGE_O_PATH=$PATH}
PATH=$SGE_O_PATH

# Get our host name if not being run by the SGE
: ${HOSTNAME=`uname -n`}
xeqHost=`echo $HOSTNAME | sed 's/\..*//'`

# Get the name of the host that originally submitted the job
: ${SGE_O_HOST=`uname -n`}
submitHost=`echo $SGE_O_HOST | sed 's/\..*//'`

# Get the name of the original user
: ${SGE_O_LOGNAME=$USER}
USER=$SGE_O_LOGNAME

# If SGE is was not given a rep number
: ${SGE_TASK_ID=1}
if [ "$SGE_TASK_ID" = "undefined" ]
then
	SGE_TASK_ID=1
fi

# Get our comand name if not being run by the SGE
: ${REQUEST=$0}
myName=$REQUEST
cmdRoot=`basename $myName`
myPath=`dirname $myName`

# Get to the default directory that we will use
: ${SGE_O_WORKDIR=`pwd`}
cd $SGE_O_WORKDIR


trap "cleanupGo Usr1" USR1
trap "cleanupGo Usr2" USR2

trap "cleanupExit Kill" KILL
trap "cleanupExit Term" TERM
trap "cleanupExit Quit" QUIT
trap "cleanupExit Hup" HUP
trap "cleanupExit Int" INT
trap "cleanupExit Stop" STOP

outputFile="output.$$"

touch $outputFile
mkdir temp$$

cleanupGo()
{
	echo "`date`: Host: $xeqHost, Restarted: $RESTARTED, counter: $counter,
Signal: $1" >>$outputFile
	mv $outputFile output.done.$$
	outputFile="output.done.$$"
	rm -rf temp$$
}
cleanupExit()
{
	echo "`date`: Host: $xeqHost, Restarted: $RESTARTED, counter: $counter,
Signal: $1" >>$outputFile
	exit
}

counter=0
while [ $counter -lt 100 ]
do
	echo "`date`: Host: $xeqHost, Restarted=$RESTARTED, counter=$counter"
>>$outputFile
	sleep 1

	let ++counter
done

I have run this with another script to have qmod reschedule the job
periodically:

> date
Mon Jan 24 13:04:52 EST 2005
> qsub -notify -q high.q trial
Your job 514 ("trial") has been submitted.
> sleep 10
> date
Mon Jan 24 13:05:02 EST 2005
> qmod -rq high.q
Pushed rescheduling of job 514 on host jczisny-lx.group-w-inc.com
> sleep 10
> date
Mon Jan 24 13:05:12 EST 2005
> qmod -rq high.q
Pushed rescheduling of job 514 on host dgruhn-lx.group-w-inc.com


I get three output files from this running and finally killing the job
via the qmon GUI.  My notify time on the queue is 5 seconds.

Job 1 output:
Mon Jan 24 13:04:54 EST 2005: Host: jczisny-lx, Restarted=0, counter=0
Mon Jan 24 13:04:55 EST 2005: Host: jczisny-lx, Restarted=0, counter=1
Mon Jan 24 13:04:56 EST 2005: Host: jczisny-lx, Restarted=0, counter=2
Mon Jan 24 13:04:57 EST 2005: Host: jczisny-lx, Restarted=0, counter=3
Mon Jan 24 13:04:58 EST 2005: Host: jczisny-lx, Restarted=0, counter=4
Mon Jan 24 13:04:59 EST 2005: Host: jczisny-lx, Restarted=0, counter=5
Mon Jan 24 13:05:00 EST 2005: Host: jczisny-lx, Restarted=0, counter=6
Mon Jan 24 13:05:01 EST 2005: Host: jczisny-lx, Restarted=0, counter=7
Mon Jan 24 13:05:02 EST 2005: Host: jczisny-lx, Restarted=0, counter=8
<----- qmod executed
Mon Jan 24 13:05:03 EST 2005: Host: jczisny-lx, Restarted=0, counter=9
Mon Jan 24 13:05:04 EST 2005: Host: jczisny-lx, Restarted=0, counter=10
<------ Job 2 starts
Mon Jan 24 13:05:05 EST 2005: Host: jczisny-lx, Restarted=0, counter=11
Mon Jan 24 13:05:06 EST 2005: Host: jczisny-lx, Restarted=0, counter=12
Mon Jan 24 13:05:07 EST 2005: Host: jczisny-lx, Restarted=0, counter=13
Mon Jan 24 13:05:08 EST 2005: Host: jczisny-lx, Restarted=0, counter=14
Mon Jan 24 13:05:09 EST 2005: Host: jczisny-lx, Restarted=0, counter=15
Mon Jan 24 13:05:10 EST 2005: Host: jczisny-lx, Restarted=0, counter=16
Mon Jan 24 13:05:11 EST 2005: Host: jczisny-lx, Restarted=0, counter=17
Mon Jan 24 13:05:12 EST 2005: Host: jczisny-lx, Restarted=0, counter=18
Mon Jan 24 13:05:13 EST 2005: Host: jczisny-lx, Restarted=0, counter=19
Mon Jan 24 13:05:14 EST 2005: Host: jczisny-lx, Restarted: 0, counter:
19, Signal: Usr2 <------ User signal 2 received
Mon Jan 24 13:05:14 EST 2005: Host: jczisny-lx, Restarted=0, counter=20
Mon Jan 24 13:05:15 EST 2005: Host: jczisny-lx, Restarted=0, counter=21
Mon Jan 24 13:05:16 EST 2005: Host: jczisny-lx, Restarted=0, counter=22
Mon Jan 24 13:05:17 EST 2005: Host: jczisny-lx, Restarted=0, counter=23
Mon Jan 24 13:05:18 EST 2005: Host: jczisny-lx, Restarted=0, counter=24
<----- Job dies 5 seconds after USR2


Job 2 output:
Mon Jan 24 13:05:04 EST 2005: Host: dgruhn-lx, Restarted=1, counter=0
Mon Jan 24 13:05:05 EST 2005: Host: dgruhn-lx, Restarted=1, counter=1
Mon Jan 24 13:05:06 EST 2005: Host: dgruhn-lx, Restarted=1, counter=2
Mon Jan 24 13:05:07 EST 2005: Host: dgruhn-lx, Restarted=1, counter=3
Mon Jan 24 13:05:08 EST 2005: Host: dgruhn-lx, Restarted=1, counter=4
Mon Jan 24 13:05:09 EST 2005: Host: dgruhn-lx, Restarted=1, counter=5
Mon Jan 24 13:05:10 EST 2005: Host: dgruhn-lx, Restarted=1, counter=6
Mon Jan 24 13:05:11 EST 2005: Host: dgruhn-lx, Restarted=1, counter=7
Mon Jan 24 13:05:12 EST 2005: Host: dgruhn-lx, Restarted=1, counter=8
<----- qmod executed
Mon Jan 24 13:05:13 EST 2005: Host: dgruhn-lx, Restarted=1, counter=9
Mon Jan 24 13:05:14 EST 2005: Host: dgruhn-lx, Restarted=1, counter=10
<------ Job 3 starts
Mon Jan 24 13:05:15 EST 2005: Host: dgruhn-lx, Restarted=1, counter=11
Mon Jan 24 13:05:16 EST 2005: Host: dgruhn-lx, Restarted=1, counter=12
Mon Jan 24 13:05:17 EST 2005: Host: dgruhn-lx, Restarted=1, counter=13
Mon Jan 24 13:05:18 EST 2005: Host: dgruhn-lx, Restarted=1, counter=14
<------ Job 1 ends
Mon Jan 24 13:05:19 EST 2005: Host: dgruhn-lx, Restarted=1, counter=15
Mon Jan 24 13:05:20 EST 2005: Host: dgruhn-lx, Restarted=1, counter=16
Mon Jan 24 13:05:21 EST 2005: Host: dgruhn-lx, Restarted=1, counter=17
Mon Jan 24 13:05:22 EST 2005: Host: dgruhn-lx, Restarted=1, counter=18
Mon Jan 24 13:05:23 EST 2005: Host: dgruhn-lx, Restarted=1, counter=19
Mon Jan 24 13:05:24 EST 2005: Host: dgruhn-lx, Restarted=1, counter=20
Mon Jan 24 13:05:25 EST 2005: Host: dgruhn-lx, Restarted=1, counter=21
Mon Jan 24 13:05:26 EST 2005: Host: dgruhn-lx, Restarted=1, counter=22
Mon Jan 24 13:05:27 EST 2005: Host: dgruhn-lx, Restarted=1, counter=23
Mon Jan 24 13:05:28 EST 2005: Host: dgruhn-lx, Restarted=1, counter=24
Mon Jan 24 13:05:29 EST 2005: Host: dgruhn-lx, Restarted=1, counter=25
Mon Jan 24 13:05:30 EST 2005: Host: dgruhn-lx, Restarted=1, counter=26
Mon Jan 24 13:05:31 EST 2005: Host: dgruhn-lx, Restarted: 1, counter:
26, Signal: Usr2 <------ User signal 2 received
Mon Jan 24 13:05:31 EST 2005: Host: dgruhn-lx, Restarted=1, counter=27
Mon Jan 24 13:05:32 EST 2005: Host: dgruhn-lx, Restarted=1, counter=28
Mon Jan 24 13:05:33 EST 2005: Host: dgruhn-lx, Restarted=1, counter=29
Mon Jan 24 13:05:34 EST 2005: Host: dgruhn-lx, Restarted=1, counter=30
Mon Jan 24 13:05:35 EST 2005: Host: dgruhn-lx, Restarted=1, counter=31
<----- Job dies 5 seconds after USR2


Job 3 output:
Mon Jan 24 13:05:14 EST 2005: Host: dwarme-lx, Restarted=1, counter=0
Mon Jan 24 13:05:15 EST 2005: Host: dwarme-lx, Restarted=1, counter=1
Mon Jan 24 13:05:16 EST 2005: Host: dwarme-lx, Restarted=1, counter=2
Mon Jan 24 13:05:17 EST 2005: Host: dwarme-lx, Restarted=1, counter=3
Mon Jan 24 13:05:18 EST 2005: Host: dwarme-lx, Restarted=1, counter=4
Mon Jan 24 13:05:19 EST 2005: Host: dwarme-lx, Restarted=1, counter=5
Mon Jan 24 13:05:20 EST 2005: Host: dwarme-lx, Restarted=1, counter=6
Mon Jan 24 13:05:21 EST 2005: Host: dwarme-lx, Restarted=1, counter=7
Mon Jan 24 13:05:22 EST 2005: Host: dwarme-lx, Restarted=1, counter=8
Mon Jan 24 13:05:22 EST 2005: Host: dwarme-lx, Restarted: 1, counter: 8,
Signal: Usr2 <------ User signal 2 received
Mon Jan 24 13:05:22 EST 2005: Host: dwarme-lx, Restarted=1, counter=9
Mon Jan 24 13:05:23 EST 2005: Host: dwarme-lx, Restarted=1, counter=10
Mon Jan 24 13:05:24 EST 2005: Host: dwarme-lx, Restarted=1, counter=11
Mon Jan 24 13:05:25 EST 2005: Host: dwarme-lx, Restarted=1, counter=12
Mon Jan 24 13:05:26 EST 2005: Host: dwarme-lx, Restarted=1, counter=13
<----- Job dies 5 seconds after USR2



On Fri, 2005-01-21 at 18:37, Ron Chen wrote: 

> Can you reproduce the problem with the command line
> tool "qmod"?
> 
> And, you can use mailer to clean up the aborted jobs,
> since the "mailer" can be anything, so you can point
> it to a script, and in the script, do whatever you
> want (since it runs as root).
> 
>  -Ron
> 
> -- Dan Gruhn <Dan.Gruhn at Group-W-Inc.com> wrote:
> > Just by clicking the Reschedule button in the job
> > display dialog box.
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Mail - 250MB free storage. Do more. Manage less. 
> http://info.mail.yahoo.com/mail_250
> 
> --
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list