[GE users] Cleanup on Rescheduling and Deleting

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Tue Jan 25 15:11:27 GMT 2005


I am using the latest 6.0u3 binary download on lx24_x84 and I also only
get $RESTARTED as 0 or 1.  Using $RESTARTED to optimize behaviour would
be a good improvement, but I'm thinking I need to submit this as an
issue.  Ron, please see if you can duplicate my results on a similar
system.

 I've refined my test script a bit so here is the latest.  First, this
is the job script that gets submitted:
------------------------------------
#! /bin/bash

#$ -o $HOME/gridoutput/$JOB_NAME.out -j y
#$ -S /bin/bash

set -u

# Set up restart status
: ${RESTARTED=0}

# Get our original execution search path if being run by SGE
: ${SGE_O_PATH=$PATH}
PATH=$SGE_O_PATH

# Get our host name if not being run by the SGE
: ${HOSTNAME=`uname -n`}
xeqHost=`echo $HOSTNAME | sed 's/\..*//'`

# If SGE is was not given a rep number
: ${SGE_TASK_ID=1}
if [ "$SGE_TASK_ID" = "undefined" ]
then
	SGE_TASK_ID=1
fi

# Get to the default directory that we will use
: ${SGE_O_WORKDIR=`pwd`}
cd $SGE_O_WORKDIR


trap "cleanup Usr2" USR2

trap "cleanup Term" TERM
trap "cleanup Quit" QUIT
trap "cleanup Hup" HUP
trap "cleanup Int" INT

outputFile="output.$$"

touch $outputFile
mkdir temp$$

cleanup()
{
	echo "`date`: Host=$xeqHost, Restarted=$RESTARTED, counter=$counter,
Signal: $1" >>$outputFile
	mv $outputFile output.done.$$
	outputFile="output.done.$$"
	rm -rf temp$$
}

counter=0
while [ $counter -lt 100 ]
do
	echo "`date`: Host=$xeqHost, Restarted=$RESTARTED, counter=$counter"
>>$outputFile
	sleep 1

	let ++counter
done
------------------------------

Next, here is the script I use to submit, reschedule, and finally delete
the job:

----------------------------------
echo "> date"
date
echo "> qsub -notify -q high.q trial"
qsub -notify -q high.q trial

echo "> sleep 10"
sleep 10

echo "> date"
date
echo "> qmod -rq high.q"
qmod -rq high.q

echo "> sleep 10"
sleep 10

echo "> date"
date
echo "> qmod -rq high.q"
qmod -rq high.q

echo "> sleep 10"
sleep 10

echo "> date"
date
echo "> qdel -u dgruhn"
qdel -u dgruhn
----------------------------------

Here is the output from the controlling script as it runs:

----------------------------------
> date
Tue Jan 25 09:52:41 EST 2005
> qsub -notify -q high.q trial
Your job 674 ("trial") has been submitted.
> sleep 10
> date
Tue Jan 25 09:52:51 EST 2005
> qmod -rq high.q
Pushed rescheduling of job 674 on host dgruhn-lx.group-w-inc.com
> sleep 10
> date
Tue Jan 25 09:53:01 EST 2005
> qmod -rq high.q
Pushed rescheduling of job 674 on host dwarme-lx.group-w-inc.com
> sleep 10
> date
Tue Jan 25 09:53:11 EST 2005
> qdel -u dgruhn
dgruhn has registered the job 674 for deletion
---------------------------------------

Here is the output from INSTANCE-1 the first submission:

----------------------------------
Tue Jan 25 09:52:42 EST 2005: Host=dgruhn-lx, Restarted=0, counter=0
Tue Jan 25 09:52:43 EST 2005: Host=dgruhn-lx, Restarted=0, counter=1
Tue Jan 25 09:52:44 EST 2005: Host=dgruhn-lx, Restarted=0, counter=2
Tue Jan 25 09:52:45 EST 2005: Host=dgruhn-lx, Restarted=0, counter=3
Tue Jan 25 09:52:46 EST 2005: Host=dgruhn-lx, Restarted=0, counter=4
Tue Jan 25 09:52:47 EST 2005: Host=dgruhn-lx, Restarted=0, counter=5
Tue Jan 25 09:52:48 EST 2005: Host=dgruhn-lx, Restarted=0, counter=6
Tue Jan 25 09:52:49 EST 2005: Host=dgruhn-lx, Restarted=0, counter=7
Tue Jan 25 09:52:50 EST 2005: Host=dgruhn-lx, Restarted=0, counter=8
Tue Jan 25 09:52:51 EST 2005: Host=dgruhn-lx, Restarted=0, counter=9
Tue Jan 25 09:52:52 EST 2005: Host=dgruhn-lx, Restarted=0, counter=10
<----- Reschedule is requested
Tue Jan 25 09:52:53 EST 2005: Host=dgruhn-lx, Restarted=0, counter=11
Tue Jan 25 09:52:54 EST 2005: Host=dgruhn-lx, Restarted=0, counter=12
Tue Jan 25 09:52:55 EST 2005: Host=dgruhn-lx, Restarted=0, counter=13
Tue Jan 25 09:52:56 EST 2005: Host=dgruhn-lx, Restarted=0, counter=14
Tue Jan 25 09:52:57 EST 2005: Host=dgruhn-lx, Restarted=0, counter=15
Tue Jan 25 09:52:58 EST 2005: Host=dgruhn-lx, Restarted=0, counter=16
Tue Jan 25 09:52:59 EST 2005: Host=dgruhn-lx, Restarted=0, counter=17
Tue Jan 25 09:53:00 EST 2005: Host=dgruhn-lx, Restarted=0, counter=18
Tue Jan 25 09:53:01 EST 2005: Host=dgruhn-lx, Restarted=0, counter=19
Tue Jan 25 09:53:02 EST 2005: Host=dgruhn-lx, Restarted=0, counter=20
Tue Jan 25 09:53:03 EST 2005: Host=dgruhn-lx, Restarted=0, counter=21
Tue Jan 25 09:53:04 EST 2005: Host=dgruhn-lx, Restarted=0, counter=22
Tue Jan 25 09:53:05 EST 2005: Host=dgruhn-lx, Restarted=0, counter=23
Tue Jan 25 09:53:06 EST 2005: Host=dgruhn-lx, Restarted=0, counter=24
Tue Jan 25 09:53:07 EST 2005: Host=dgruhn-lx, Restarted=0, counter=25
Tue Jan 25 09:53:08 EST 2005: Host=dgruhn-lx, Restarted=0, counter=25,
Signal: Usr2
Tue Jan 25 09:53:08 EST 2005: Host=dgruhn-lx, Restarted=0, counter=26
Tue Jan 25 09:53:09 EST 2005: Host=dgruhn-lx, Restarted=0, counter=27
Tue Jan 25 09:53:10 EST 2005: Host=dgruhn-lx, Restarted=0, counter=28
Tue Jan 25 09:53:11 EST 2005: Host=dgruhn-lx, Restarted=0, counter=29
Tue Jan 25 09:53:12 EST 2005: Host=dgruhn-lx, Restarted=0, counter=30
----------------------------------

Here is the output from INSTANCE-2, the job after it gets rescheduled
the first time:

----------------------------------
Tue Jan 25 09:52:52 EST 2005: Host=dwarme-lx, Restarted=1, counter=0
Tue Jan 25 09:52:53 EST 2005: Host=dwarme-lx, Restarted=1, counter=1
Tue Jan 25 09:52:54 EST 2005: Host=dwarme-lx, Restarted=1, counter=2
Tue Jan 25 09:52:55 EST 2005: Host=dwarme-lx, Restarted=1, counter=3
Tue Jan 25 09:52:56 EST 2005: Host=dwarme-lx, Restarted=1, counter=4
Tue Jan 25 09:52:57 EST 2005: Host=dwarme-lx, Restarted=1, counter=5
Tue Jan 25 09:52:58 EST 2005: Host=dwarme-lx, Restarted=1, counter=6
Tue Jan 25 09:52:59 EST 2005: Host=dwarme-lx, Restarted=1, counter=7
Tue Jan 25 09:53:00 EST 2005: Host=dwarme-lx, Restarted=1, counter=8
Tue Jan 25 09:53:01 EST 2005: Host=dwarme-lx, Restarted=1, counter=9
<----- Reschedule is requested
Tue Jan 25 09:53:02 EST 2005: Host=dwarme-lx, Restarted=1, counter=10
Tue Jan 25 09:53:03 EST 2005: Host=dwarme-lx, Restarted=1, counter=11
Tue Jan 25 09:53:04 EST 2005: Host=dwarme-lx, Restarted=1, counter=12
Tue Jan 25 09:53:05 EST 2005: Host=dwarme-lx, Restarted=1, counter=13
Tue Jan 25 09:53:06 EST 2005: Host=dwarme-lx, Restarted=1, counter=14
Tue Jan 25 09:53:07 EST 2005: Host=dwarme-lx, Restarted=1, counter=15
Tue Jan 25 09:53:08 EST 2005: Host=dwarme-lx, Restarted=1, counter=16
Tue Jan 25 09:53:09 EST 2005: Host=dwarme-lx, Restarted=1, counter=17
Tue Jan 25 09:53:10 EST 2005: Host=dwarme-lx, Restarted=1, counter=18
Tue Jan 25 09:53:11 EST 2005: Host=dwarme-lx, Restarted=1, counter=19
Tue Jan 25 09:53:12 EST 2005: Host=dwarme-lx, Restarted=1, counter=20
<------ INSTANCE-1 dies
Tue Jan 25 09:53:13 EST 2005: Host=dwarme-lx, Restarted=1, counter=21
Tue Jan 25 09:53:14 EST 2005: Host=dwarme-lx, Restarted=1, counter=22
Tue Jan 25 09:53:15 EST 2005: Host=dwarme-lx, Restarted=1, counter=23
Tue Jan 25 09:53:16 EST 2005: Host=dwarme-lx, Restarted=1, counter=24
Tue Jan 25 09:53:17 EST 2005: Host=dwarme-lx, Restarted=1, counter=25
Tue Jan 25 09:53:18 EST 2005: Host=dwarme-lx, Restarted=1, counter=26
Tue Jan 25 09:53:19 EST 2005: Host=dwarme-lx, Restarted=1, counter=27
Tue Jan 25 09:53:20 EST 2005: Host=dwarme-lx, Restarted=1, counter=28
Tue Jan 25 09:53:21 EST 2005: Host=dwarme-lx, Restarted=1, counter=28,
Signal: Usr2
Tue Jan 25 09:53:21 EST 2005: Host=dwarme-lx, Restarted=1, counter=29
Tue Jan 25 09:53:22 EST 2005: Host=dwarme-lx, Restarted=1, counter=30
Tue Jan 25 09:53:23 EST 2005: Host=dwarme-lx, Restarted=1, counter=31
Tue Jan 25 09:53:24 EST 2005: Host=dwarme-lx, Restarted=1, counter=32
Tue Jan 25 09:53:25 EST 2005: Host=dwarme-lx, Restarted=1, counter=33
----------------------------------

Finally, here is the output of INSTANCE-3, the job after it is
rescheduled the second time and then finally killed:

----------------------------------
Tue Jan 25 09:53:02 EST 2005: Host=jczisny-lx, Restarted=1, counter=0
Tue Jan 25 09:53:03 EST 2005: Host=jczisny-lx, Restarted=1, counter=1
Tue Jan 25 09:53:04 EST 2005: Host=jczisny-lx, Restarted=1, counter=2
Tue Jan 25 09:53:05 EST 2005: Host=jczisny-lx, Restarted=1, counter=3
Tue Jan 25 09:53:06 EST 2005: Host=jczisny-lx, Restarted=1, counter=4
Tue Jan 25 09:53:07 EST 2005: Host=jczisny-lx, Restarted=1, counter=5
Tue Jan 25 09:53:08 EST 2005: Host=jczisny-lx, Restarted=1, counter=6
Tue Jan 25 09:53:09 EST 2005: Host=jczisny-lx, Restarted=1, counter=7
Tue Jan 25 09:53:10 EST 2005: Host=jczisny-lx, Restarted=1, counter=8
Tue Jan 25 09:53:11 EST 2005: Host=jczisny-lx, Restarted=1, counter=8,
Signal: Usr2 <----- Delete is requested
Tue Jan 25 09:53:11 EST 2005: Host=jczisny-lx, Restarted=1, counter=9
Tue Jan 25 09:53:12 EST 2005: Host=jczisny-lx, Restarted=1, counter=10
<------ INSTANCE-1 dies
Tue Jan 25 09:53:13 EST 2005: Host=jczisny-lx, Restarted=1, counter=11
Tue Jan 25 09:53:14 EST 2005: Host=jczisny-lx, Restarted=1, counter=12
Tue Jan 25 09:53:15 EST 2005: Host=jczisny-lx, Restarted=1, counter=13
INSTANCE-3 dies
...
Tue Jan 25 09:53:25 EST 2005:<------ INSTANCE-2 dies






On Tue, 2005-01-25 at 04:34, Reuti wrote:

> Quoting Ron Chen <ron_chen_123 at yahoo.com>:
> 
> > Can your job scripts check if the environment var
> > $RESTARTED to the number of times SGE has restarted
> > it?
> 
> For me, $RESTARTED is only 0 or 1. Unless you are using application-level 
> checkpointing, then it's always 2 in case it is restarted. But it would be 
> nice, if it would count the number of restarts.
>  
> > And as an optimization, when $RESTARTED is 0, then
> > don't sleep or clear the job output file.
> > 
> > BTW, I am not getting the behaviour you are getting.
> > SGE always waits for the rescheduled jobs. Can you
> > post  a sample job script?
> 
> I can reproduce the behavior on 6.0u1 on lx24_amd64 and 5.3p6 on x86. I checked 
> the clocks on the master and slaves and got around 30 seconds in both cases, 
> until the old job really is killed.
> 
> > --- Dan Gruhn <Dan.Gruhn at Group-W-Inc.com> wrote:
> > > Hi Reuti,
> > > 
> > > Yes, delaying my script for a minute would be a work
> > > around for now. 
> > > However, I am trying to squeeze as much out of my
> > > machines as I can and
> > > I am thinking that SGE's behavior in this case is
> > > wrong.  It should not
> > > be running the same job at the same time on
> > > different CPUs under these
> > > or any other circumstances.
> > > 
> > > I think the proper sequence of events should be:
> > > 
> > > 1) Reschedule is requested
> > > 2) Job 1 gets the USR2 signal
> > > 3) After the notify time, job 1 exits
> > > 4) Job 2 is now scheduled to be run.
> > > 
> > > Does this seem right to you?
> 
> Yes, agreed. The interesting thing is, that the job is immediately removed in 
> the qstat output from the old node. I mean, in case of a qdel, you can 
> sometimes see the job staying there for some additional seconds until it 
> disappears.
> 
> Cheers - Reuti
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list