[GE users] Cleanup on Rescheduling and Deleting

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Tue Jan 25 01:58:12 GMT 2005


Hi Reuti,

Yes, delaying my script for a minute would be a work around for now. 
However, I am trying to squeeze as much out of my machines as I can and
I am thinking that SGE's behavior in this case is wrong.  It should not
be running the same job at the same time on different CPUs under these
or any other circumstances.

I think the proper sequence of events should be:

1) Reschedule is requested
2) Job 1 gets the USR2 signal
3) After the notify time, job 1 exits
4) Job 2 is now scheduled to be run.

Does this seem right to you?

(I suppose it is okay to schedule Job 2 earlier, but keep it suspended
until Job 1 exits.)

I came across this problem in a unusual way.  I had some jobs to run and
although some of my lower speed machines were not ideal I allowed
queuing to them.  Later, when some higher speed machines became
available, I realized it would be faster to reschedule some jobs running
on lower speed machines to the high speed machines even though they
would have to start over.  After having done this, I saw the temporary
directories being left around and started exploring to see just what was
going on.

So, now I know about USR2 which will help with my temporary directories
if a job is killed, but the overlap of execution is a problem if a
reschedule actually happens because I have some bookkeeping files that
track pieces of my overall job and if job 1 dies after job 2 starts, Job
2 will see that the work it is being asked to do is already in progress
and will exit with an error message telling the user he/she is trying to
duplicate work already in progress.

My real script does indeed exit when the USR2 is seen, this was just a
test script and I was trying to test everything to see what is going
on.  Thanks for the reminder about STOP and KILL in any case.

Dan 

On Mon, 2005-01-24 at 19:11, Reuti wrote:

> Hi,
> 
> I see the problem of course, and although it will not solve the problem in a 
> way inside SGE to delay the start, a solution maybe to sleep a minute at the 
> beginning of your script in case it's restarted and copy /dev/null into the 
> outputfile $SGE_STDOUT_PATH to empty it.
> 
> But what is the reason to reschedule the job in your workflow, when there are 
> enough free slots to get it rescheduled immediately?
> 
> Cheers - Reuti
> 
> BTW: kill and stop can't be caught. And as you already catch usr2 (which alerts 
> your script of the coming kill), you could exit already in this subroutine - 
> the script will be killed anyway.
> 
> 
> Quoting Dan Gruhn <Dan.Gruhn at Group-W-Inc.com>:
> 
> > It turns out I had a bug in my testing script.  Once I fixed it, I have
> > found that I can reliably get a USR2 signal for rescheduling and
> > deleting if you use "qsub -notify".  This is good news for me.
> > 
> > However, there is indeed a problem with overlapping execution when a job
> > is rescheduled.  That is, the first job continues to run for some time
> > after the rescheduled job is has started up.  I've tried to provide the
> > information need to reproduce this.  Perhaps someone knows of some
> > parameters that govern this overlapping.  It happens both with the qmon
> > GUI and the qmod command line.
> > 
> > The worst thing, as I've said before, is that the output of overlapping
> > jobs fight for use of the stdout/stderr files with one or the other
> > getting in, but not both.  In my test case, I actually have three jobs
> > overlapping, but it is just a test case.
> > 
> > Here is a testing script I have been using:
> > 
> > #! /bin/bash
> > 
> > #$ -o $HOME/gridoutput/$JOB_NAME.out -j y
> > #$ -S /bin/bash
> > 
> > set -u
> > 
> > # Set up restart status
> > : ${RESTARTED=0}
> > 
> > # Get our original execution search path if being run by SGE
> > : ${SGE_O_PATH=$PATH}
> > PATH=$SGE_O_PATH
> > 
> > # Get our host name if not being run by the SGE
> > : ${HOSTNAME=`uname -n`}
> > xeqHost=`echo $HOSTNAME | sed 's/\..*//'`
> > 
> > # Get the name of the host that originally submitted the job
> > : ${SGE_O_HOST=`uname -n`}
> > submitHost=`echo $SGE_O_HOST | sed 's/\..*//'`
> > 
> > # Get the name of the original user
> > : ${SGE_O_LOGNAME=$USER}
> > USER=$SGE_O_LOGNAME
> > 
> > # If SGE is was not given a rep number
> > : ${SGE_TASK_ID=1}
> > if [ "$SGE_TASK_ID" = "undefined" ]
> > then
> > 	SGE_TASK_ID=1
> > fi
> > 
> > # Get our comand name if not being run by the SGE
> > : ${REQUEST=$0}
> > myName=$REQUEST
> > cmdRoot=`basename $myName`
> > myPath=`dirname $myName`
> > 
> > # Get to the default directory that we will use
> > : ${SGE_O_WORKDIR=`pwd`}
> > cd $SGE_O_WORKDIR
> > 
> > 
> > trap "cleanupGo Usr1" USR1
> > trap "cleanupGo Usr2" USR2
> > 
> > trap "cleanupExit Kill" KILL
> > trap "cleanupExit Term" TERM
> > trap "cleanupExit Quit" QUIT
> > trap "cleanupExit Hup" HUP
> > trap "cleanupExit Int" INT
> > trap "cleanupExit Stop" STOP
> > 
> > outputFile="output.$$"
> > 
> > touch $outputFile
> > mkdir temp$$
> > 
> > cleanupGo()
> > {
> > 	echo "`date`: Host: $xeqHost, Restarted: $RESTARTED, counter: $counter,
> > Signal: $1" >>$outputFile
> > 	mv $outputFile output.done.$$
> > 	outputFile="output.done.$$"
> > 	rm -rf temp$$
> > }
> > cleanupExit()
> > {
> > 	echo "`date`: Host: $xeqHost, Restarted: $RESTARTED, counter: $counter,
> > Signal: $1" >>$outputFile
> > 	exit
> > }
> > 
> > counter=0
> > while [ $counter -lt 100 ]
> > do
> > 	echo "`date`: Host: $xeqHost, Restarted=$RESTARTED, counter=$counter"
> > >>$outputFile
> > 	sleep 1
> > 
> > 	let ++counter
> > done
> > 
> > I have run this with another script to have qmod reschedule the job
> > periodically:
> > 
> > > date
> > Mon Jan 24 13:04:52 EST 2005
> > > qsub -notify -q high.q trial
> > Your job 514 ("trial") has been submitted.
> > > sleep 10
> > > date
> > Mon Jan 24 13:05:02 EST 2005
> > > qmod -rq high.q
> > Pushed rescheduling of job 514 on host jczisny-lx.group-w-inc.com
> > > sleep 10
> > > date
> > Mon Jan 24 13:05:12 EST 2005
> > > qmod -rq high.q
> > Pushed rescheduling of job 514 on host dgruhn-lx.group-w-inc.com
> > 
> > 
> > I get three output files from this running and finally killing the job
> > via the qmon GUI.  My notify time on the queue is 5 seconds.
> > 
> > Job 1 output:
> > Mon Jan 24 13:04:54 EST 2005: Host: jczisny-lx, Restarted=0, counter=0
> > Mon Jan 24 13:04:55 EST 2005: Host: jczisny-lx, Restarted=0, counter=1
> > Mon Jan 24 13:04:56 EST 2005: Host: jczisny-lx, Restarted=0, counter=2
> > Mon Jan 24 13:04:57 EST 2005: Host: jczisny-lx, Restarted=0, counter=3
> > Mon Jan 24 13:04:58 EST 2005: Host: jczisny-lx, Restarted=0, counter=4
> > Mon Jan 24 13:04:59 EST 2005: Host: jczisny-lx, Restarted=0, counter=5
> > Mon Jan 24 13:05:00 EST 2005: Host: jczisny-lx, Restarted=0, counter=6
> > Mon Jan 24 13:05:01 EST 2005: Host: jczisny-lx, Restarted=0, counter=7
> > Mon Jan 24 13:05:02 EST 2005: Host: jczisny-lx, Restarted=0, counter=8
> > <----- qmod executed
> > Mon Jan 24 13:05:03 EST 2005: Host: jczisny-lx, Restarted=0, counter=9
> > Mon Jan 24 13:05:04 EST 2005: Host: jczisny-lx, Restarted=0, counter=10
> > <------ Job 2 starts
> > Mon Jan 24 13:05:05 EST 2005: Host: jczisny-lx, Restarted=0, counter=11
> > Mon Jan 24 13:05:06 EST 2005: Host: jczisny-lx, Restarted=0, counter=12
> > Mon Jan 24 13:05:07 EST 2005: Host: jczisny-lx, Restarted=0, counter=13
> > Mon Jan 24 13:05:08 EST 2005: Host: jczisny-lx, Restarted=0, counter=14
> > Mon Jan 24 13:05:09 EST 2005: Host: jczisny-lx, Restarted=0, counter=15
> > Mon Jan 24 13:05:10 EST 2005: Host: jczisny-lx, Restarted=0, counter=16
> > Mon Jan 24 13:05:11 EST 2005: Host: jczisny-lx, Restarted=0, counter=17
> > Mon Jan 24 13:05:12 EST 2005: Host: jczisny-lx, Restarted=0, counter=18
> > Mon Jan 24 13:05:13 EST 2005: Host: jczisny-lx, Restarted=0, counter=19
> > Mon Jan 24 13:05:14 EST 2005: Host: jczisny-lx, Restarted: 0, counter:
> > 19, Signal: Usr2 <------ User signal 2 received
> > Mon Jan 24 13:05:14 EST 2005: Host: jczisny-lx, Restarted=0, counter=20
> > Mon Jan 24 13:05:15 EST 2005: Host: jczisny-lx, Restarted=0, counter=21
> > Mon Jan 24 13:05:16 EST 2005: Host: jczisny-lx, Restarted=0, counter=22
> > Mon Jan 24 13:05:17 EST 2005: Host: jczisny-lx, Restarted=0, counter=23
> > Mon Jan 24 13:05:18 EST 2005: Host: jczisny-lx, Restarted=0, counter=24
> > <----- Job dies 5 seconds after USR2
> > 
> > 
> > Job 2 output:
> > Mon Jan 24 13:05:04 EST 2005: Host: dgruhn-lx, Restarted=1, counter=0
> > Mon Jan 24 13:05:05 EST 2005: Host: dgruhn-lx, Restarted=1, counter=1
> > Mon Jan 24 13:05:06 EST 2005: Host: dgruhn-lx, Restarted=1, counter=2
> > Mon Jan 24 13:05:07 EST 2005: Host: dgruhn-lx, Restarted=1, counter=3
> > Mon Jan 24 13:05:08 EST 2005: Host: dgruhn-lx, Restarted=1, counter=4
> > Mon Jan 24 13:05:09 EST 2005: Host: dgruhn-lx, Restarted=1, counter=5
> > Mon Jan 24 13:05:10 EST 2005: Host: dgruhn-lx, Restarted=1, counter=6
> > Mon Jan 24 13:05:11 EST 2005: Host: dgruhn-lx, Restarted=1, counter=7
> > Mon Jan 24 13:05:12 EST 2005: Host: dgruhn-lx, Restarted=1, counter=8
> > <----- qmod executed
> > Mon Jan 24 13:05:13 EST 2005: Host: dgruhn-lx, Restarted=1, counter=9
> > Mon Jan 24 13:05:14 EST 2005: Host: dgruhn-lx, Restarted=1, counter=10
> > <------ Job 3 starts
> > Mon Jan 24 13:05:15 EST 2005: Host: dgruhn-lx, Restarted=1, counter=11
> > Mon Jan 24 13:05:16 EST 2005: Host: dgruhn-lx, Restarted=1, counter=12
> > Mon Jan 24 13:05:17 EST 2005: Host: dgruhn-lx, Restarted=1, counter=13
> > Mon Jan 24 13:05:18 EST 2005: Host: dgruhn-lx, Restarted=1, counter=14
> > <------ Job 1 ends
> > Mon Jan 24 13:05:19 EST 2005: Host: dgruhn-lx, Restarted=1, counter=15
> > Mon Jan 24 13:05:20 EST 2005: Host: dgruhn-lx, Restarted=1, counter=16
> > Mon Jan 24 13:05:21 EST 2005: Host: dgruhn-lx, Restarted=1, counter=17
> > Mon Jan 24 13:05:22 EST 2005: Host: dgruhn-lx, Restarted=1, counter=18
> > Mon Jan 24 13:05:23 EST 2005: Host: dgruhn-lx, Restarted=1, counter=19
> > Mon Jan 24 13:05:24 EST 2005: Host: dgruhn-lx, Restarted=1, counter=20
> > Mon Jan 24 13:05:25 EST 2005: Host: dgruhn-lx, Restarted=1, counter=21
> > Mon Jan 24 13:05:26 EST 2005: Host: dgruhn-lx, Restarted=1, counter=22
> > Mon Jan 24 13:05:27 EST 2005: Host: dgruhn-lx, Restarted=1, counter=23
> > Mon Jan 24 13:05:28 EST 2005: Host: dgruhn-lx, Restarted=1, counter=24
> > Mon Jan 24 13:05:29 EST 2005: Host: dgruhn-lx, Restarted=1, counter=25
> > Mon Jan 24 13:05:30 EST 2005: Host: dgruhn-lx, Restarted=1, counter=26
> > Mon Jan 24 13:05:31 EST 2005: Host: dgruhn-lx, Restarted: 1, counter:
> > 26, Signal: Usr2 <------ User signal 2 received
> > Mon Jan 24 13:05:31 EST 2005: Host: dgruhn-lx, Restarted=1, counter=27
> > Mon Jan 24 13:05:32 EST 2005: Host: dgruhn-lx, Restarted=1, counter=28
> > Mon Jan 24 13:05:33 EST 2005: Host: dgruhn-lx, Restarted=1, counter=29
> > Mon Jan 24 13:05:34 EST 2005: Host: dgruhn-lx, Restarted=1, counter=30
> > Mon Jan 24 13:05:35 EST 2005: Host: dgruhn-lx, Restarted=1, counter=31
> > <----- Job dies 5 seconds after USR2
> > 
> > 
> > Job 3 output:
> > Mon Jan 24 13:05:14 EST 2005: Host: dwarme-lx, Restarted=1, counter=0
> > Mon Jan 24 13:05:15 EST 2005: Host: dwarme-lx, Restarted=1, counter=1
> > Mon Jan 24 13:05:16 EST 2005: Host: dwarme-lx, Restarted=1, counter=2
> > Mon Jan 24 13:05:17 EST 2005: Host: dwarme-lx, Restarted=1, counter=3
> > Mon Jan 24 13:05:18 EST 2005: Host: dwarme-lx, Restarted=1, counter=4
> > Mon Jan 24 13:05:19 EST 2005: Host: dwarme-lx, Restarted=1, counter=5
> > Mon Jan 24 13:05:20 EST 2005: Host: dwarme-lx, Restarted=1, counter=6
> > Mon Jan 24 13:05:21 EST 2005: Host: dwarme-lx, Restarted=1, counter=7
> > Mon Jan 24 13:05:22 EST 2005: Host: dwarme-lx, Restarted=1, counter=8
> > Mon Jan 24 13:05:22 EST 2005: Host: dwarme-lx, Restarted: 1, counter: 8,
> > Signal: Usr2 <------ User signal 2 received
> > Mon Jan 24 13:05:22 EST 2005: Host: dwarme-lx, Restarted=1, counter=9
> > Mon Jan 24 13:05:23 EST 2005: Host: dwarme-lx, Restarted=1, counter=10
> > Mon Jan 24 13:05:24 EST 2005: Host: dwarme-lx, Restarted=1, counter=11
> > Mon Jan 24 13:05:25 EST 2005: Host: dwarme-lx, Restarted=1, counter=12
> > Mon Jan 24 13:05:26 EST 2005: Host: dwarme-lx, Restarted=1, counter=13
> > <----- Job dies 5 seconds after USR2
> > 
> > 
> > 
> > On Fri, 2005-01-21 at 18:37, Ron Chen wrote: 
> > 
> > > Can you reproduce the problem with the command line
> > > tool "qmod"?
> > > 
> > > And, you can use mailer to clean up the aborted jobs,
> > > since the "mailer" can be anything, so you can point
> > > it to a script, and in the script, do whatever you
> > > want (since it runs as root).
> > > 
> > >  -Ron
> > > 
> > > -- Dan Gruhn <Dan.Gruhn at Group-W-Inc.com> wrote:
> > > > Just by clicking the Reschedule button in the job
> > > > display dialog box.
> > > 
> > > 
> > > 
> > > 		
> > > __________________________________ 
> > > Do you Yahoo!? 
> > > Yahoo! Mail - 250MB free storage. Do more. Manage less. 
> > > http://info.mail.yahoo.com/mail_250
> > > 
> > > --
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > > 
> > 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list