[GE users] Cleanup on Rescheduling and Deleting

Reuti reuti at staff.uni-marburg.de
Tue Jan 25 00:11:06 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I see the problem of course, and although it will not solve the problem in a 
way inside SGE to delay the start, a solution maybe to sleep a minute at the 
beginning of your script in case it's restarted and copy /dev/null into the 
outputfile $SGE_STDOUT_PATH to empty it.

But what is the reason to reschedule the job in your workflow, when there are 
enough free slots to get it rescheduled immediately?

Cheers - Reuti

BTW: kill and stop can't be caught. And as you already catch usr2 (which alerts 
your script of the coming kill), you could exit already in this subroutine - 
the script will be killed anyway.


Quoting Dan Gruhn <Dan.Gruhn at Group-W-Inc.com>:

> It turns out I had a bug in my testing script.  Once I fixed it, I have
> found that I can reliably get a USR2 signal for rescheduling and
> deleting if you use "qsub -notify".  This is good news for me.
> 
> However, there is indeed a problem with overlapping execution when a job
> is rescheduled.  That is, the first job continues to run for some time
> after the rescheduled job is has started up.  I've tried to provide the
> information need to reproduce this.  Perhaps someone knows of some
> parameters that govern this overlapping.  It happens both with the qmon
> GUI and the qmod command line.
> 
> The worst thing, as I've said before, is that the output of overlapping
> jobs fight for use of the stdout/stderr files with one or the other
> getting in, but not both.  In my test case, I actually have three jobs
> overlapping, but it is just a test case.
> 
> Here is a testing script I have been using:
> 
> #! /bin/bash
> 
> #$ -o $HOME/gridoutput/$JOB_NAME.out -j y
> #$ -S /bin/bash
> 
> set -u
> 
> # Set up restart status
> : ${RESTARTED=0}
> 
> # Get our original execution search path if being run by SGE
> : ${SGE_O_PATH=$PATH}
> PATH=$SGE_O_PATH
> 
> # Get our host name if not being run by the SGE
> : ${HOSTNAME=`uname -n`}
> xeqHost=`echo $HOSTNAME | sed 's/\..*//'`
> 
> # Get the name of the host that originally submitted the job
> : ${SGE_O_HOST=`uname -n`}
> submitHost=`echo $SGE_O_HOST | sed 's/\..*//'`
> 
> # Get the name of the original user
> : ${SGE_O_LOGNAME=$USER}
> USER=$SGE_O_LOGNAME
> 
> # If SGE is was not given a rep number
> : ${SGE_TASK_ID=1}
> if [ "$SGE_TASK_ID" = "undefined" ]
> then
> 	SGE_TASK_ID=1
> fi
> 
> # Get our comand name if not being run by the SGE
> : ${REQUEST=$0}
> myName=$REQUEST
> cmdRoot=`basename $myName`
> myPath=`dirname $myName`
> 
> # Get to the default directory that we will use
> : ${SGE_O_WORKDIR=`pwd`}
> cd $SGE_O_WORKDIR
> 
> 
> trap "cleanupGo Usr1" USR1
> trap "cleanupGo Usr2" USR2
> 
> trap "cleanupExit Kill" KILL
> trap "cleanupExit Term" TERM
> trap "cleanupExit Quit" QUIT
> trap "cleanupExit Hup" HUP
> trap "cleanupExit Int" INT
> trap "cleanupExit Stop" STOP
> 
> outputFile="output.$$"
> 
> touch $outputFile
> mkdir temp$$
> 
> cleanupGo()
> {
> 	echo "`date`: Host: $xeqHost, Restarted: $RESTARTED, counter: $counter,
> Signal: $1" >>$outputFile
> 	mv $outputFile output.done.$$
> 	outputFile="output.done.$$"
> 	rm -rf temp$$
> }
> cleanupExit()
> {
> 	echo "`date`: Host: $xeqHost, Restarted: $RESTARTED, counter: $counter,
> Signal: $1" >>$outputFile
> 	exit
> }
> 
> counter=0
> while [ $counter -lt 100 ]
> do
> 	echo "`date`: Host: $xeqHost, Restarted=$RESTARTED, counter=$counter"
> >>$outputFile
> 	sleep 1
> 
> 	let ++counter
> done
> 
> I have run this with another script to have qmod reschedule the job
> periodically:
> 
> > date
> Mon Jan 24 13:04:52 EST 2005
> > qsub -notify -q high.q trial
> Your job 514 ("trial") has been submitted.
> > sleep 10
> > date
> Mon Jan 24 13:05:02 EST 2005
> > qmod -rq high.q
> Pushed rescheduling of job 514 on host jczisny-lx.group-w-inc.com
> > sleep 10
> > date
> Mon Jan 24 13:05:12 EST 2005
> > qmod -rq high.q
> Pushed rescheduling of job 514 on host dgruhn-lx.group-w-inc.com
> 
> 
> I get three output files from this running and finally killing the job
> via the qmon GUI.  My notify time on the queue is 5 seconds.
> 
> Job 1 output:
> Mon Jan 24 13:04:54 EST 2005: Host: jczisny-lx, Restarted=0, counter=0
> Mon Jan 24 13:04:55 EST 2005: Host: jczisny-lx, Restarted=0, counter=1
> Mon Jan 24 13:04:56 EST 2005: Host: jczisny-lx, Restarted=0, counter=2
> Mon Jan 24 13:04:57 EST 2005: Host: jczisny-lx, Restarted=0, counter=3
> Mon Jan 24 13:04:58 EST 2005: Host: jczisny-lx, Restarted=0, counter=4
> Mon Jan 24 13:04:59 EST 2005: Host: jczisny-lx, Restarted=0, counter=5
> Mon Jan 24 13:05:00 EST 2005: Host: jczisny-lx, Restarted=0, counter=6
> Mon Jan 24 13:05:01 EST 2005: Host: jczisny-lx, Restarted=0, counter=7
> Mon Jan 24 13:05:02 EST 2005: Host: jczisny-lx, Restarted=0, counter=8
> <----- qmod executed
> Mon Jan 24 13:05:03 EST 2005: Host: jczisny-lx, Restarted=0, counter=9
> Mon Jan 24 13:05:04 EST 2005: Host: jczisny-lx, Restarted=0, counter=10
> <------ Job 2 starts
> Mon Jan 24 13:05:05 EST 2005: Host: jczisny-lx, Restarted=0, counter=11
> Mon Jan 24 13:05:06 EST 2005: Host: jczisny-lx, Restarted=0, counter=12
> Mon Jan 24 13:05:07 EST 2005: Host: jczisny-lx, Restarted=0, counter=13
> Mon Jan 24 13:05:08 EST 2005: Host: jczisny-lx, Restarted=0, counter=14
> Mon Jan 24 13:05:09 EST 2005: Host: jczisny-lx, Restarted=0, counter=15
> Mon Jan 24 13:05:10 EST 2005: Host: jczisny-lx, Restarted=0, counter=16
> Mon Jan 24 13:05:11 EST 2005: Host: jczisny-lx, Restarted=0, counter=17
> Mon Jan 24 13:05:12 EST 2005: Host: jczisny-lx, Restarted=0, counter=18
> Mon Jan 24 13:05:13 EST 2005: Host: jczisny-lx, Restarted=0, counter=19
> Mon Jan 24 13:05:14 EST 2005: Host: jczisny-lx, Restarted: 0, counter:
> 19, Signal: Usr2 <------ User signal 2 received
> Mon Jan 24 13:05:14 EST 2005: Host: jczisny-lx, Restarted=0, counter=20
> Mon Jan 24 13:05:15 EST 2005: Host: jczisny-lx, Restarted=0, counter=21
> Mon Jan 24 13:05:16 EST 2005: Host: jczisny-lx, Restarted=0, counter=22
> Mon Jan 24 13:05:17 EST 2005: Host: jczisny-lx, Restarted=0, counter=23
> Mon Jan 24 13:05:18 EST 2005: Host: jczisny-lx, Restarted=0, counter=24
> <----- Job dies 5 seconds after USR2
> 
> 
> Job 2 output:
> Mon Jan 24 13:05:04 EST 2005: Host: dgruhn-lx, Restarted=1, counter=0
> Mon Jan 24 13:05:05 EST 2005: Host: dgruhn-lx, Restarted=1, counter=1
> Mon Jan 24 13:05:06 EST 2005: Host: dgruhn-lx, Restarted=1, counter=2
> Mon Jan 24 13:05:07 EST 2005: Host: dgruhn-lx, Restarted=1, counter=3
> Mon Jan 24 13:05:08 EST 2005: Host: dgruhn-lx, Restarted=1, counter=4
> Mon Jan 24 13:05:09 EST 2005: Host: dgruhn-lx, Restarted=1, counter=5
> Mon Jan 24 13:05:10 EST 2005: Host: dgruhn-lx, Restarted=1, counter=6
> Mon Jan 24 13:05:11 EST 2005: Host: dgruhn-lx, Restarted=1, counter=7
> Mon Jan 24 13:05:12 EST 2005: Host: dgruhn-lx, Restarted=1, counter=8
> <----- qmod executed
> Mon Jan 24 13:05:13 EST 2005: Host: dgruhn-lx, Restarted=1, counter=9
> Mon Jan 24 13:05:14 EST 2005: Host: dgruhn-lx, Restarted=1, counter=10
> <------ Job 3 starts
> Mon Jan 24 13:05:15 EST 2005: Host: dgruhn-lx, Restarted=1, counter=11
> Mon Jan 24 13:05:16 EST 2005: Host: dgruhn-lx, Restarted=1, counter=12
> Mon Jan 24 13:05:17 EST 2005: Host: dgruhn-lx, Restarted=1, counter=13
> Mon Jan 24 13:05:18 EST 2005: Host: dgruhn-lx, Restarted=1, counter=14
> <------ Job 1 ends
> Mon Jan 24 13:05:19 EST 2005: Host: dgruhn-lx, Restarted=1, counter=15
> Mon Jan 24 13:05:20 EST 2005: Host: dgruhn-lx, Restarted=1, counter=16
> Mon Jan 24 13:05:21 EST 2005: Host: dgruhn-lx, Restarted=1, counter=17
> Mon Jan 24 13:05:22 EST 2005: Host: dgruhn-lx, Restarted=1, counter=18
> Mon Jan 24 13:05:23 EST 2005: Host: dgruhn-lx, Restarted=1, counter=19
> Mon Jan 24 13:05:24 EST 2005: Host: dgruhn-lx, Restarted=1, counter=20
> Mon Jan 24 13:05:25 EST 2005: Host: dgruhn-lx, Restarted=1, counter=21
> Mon Jan 24 13:05:26 EST 2005: Host: dgruhn-lx, Restarted=1, counter=22
> Mon Jan 24 13:05:27 EST 2005: Host: dgruhn-lx, Restarted=1, counter=23
> Mon Jan 24 13:05:28 EST 2005: Host: dgruhn-lx, Restarted=1, counter=24
> Mon Jan 24 13:05:29 EST 2005: Host: dgruhn-lx, Restarted=1, counter=25
> Mon Jan 24 13:05:30 EST 2005: Host: dgruhn-lx, Restarted=1, counter=26
> Mon Jan 24 13:05:31 EST 2005: Host: dgruhn-lx, Restarted: 1, counter:
> 26, Signal: Usr2 <------ User signal 2 received
> Mon Jan 24 13:05:31 EST 2005: Host: dgruhn-lx, Restarted=1, counter=27
> Mon Jan 24 13:05:32 EST 2005: Host: dgruhn-lx, Restarted=1, counter=28
> Mon Jan 24 13:05:33 EST 2005: Host: dgruhn-lx, Restarted=1, counter=29
> Mon Jan 24 13:05:34 EST 2005: Host: dgruhn-lx, Restarted=1, counter=30
> Mon Jan 24 13:05:35 EST 2005: Host: dgruhn-lx, Restarted=1, counter=31
> <----- Job dies 5 seconds after USR2
> 
> 
> Job 3 output:
> Mon Jan 24 13:05:14 EST 2005: Host: dwarme-lx, Restarted=1, counter=0
> Mon Jan 24 13:05:15 EST 2005: Host: dwarme-lx, Restarted=1, counter=1
> Mon Jan 24 13:05:16 EST 2005: Host: dwarme-lx, Restarted=1, counter=2
> Mon Jan 24 13:05:17 EST 2005: Host: dwarme-lx, Restarted=1, counter=3
> Mon Jan 24 13:05:18 EST 2005: Host: dwarme-lx, Restarted=1, counter=4
> Mon Jan 24 13:05:19 EST 2005: Host: dwarme-lx, Restarted=1, counter=5
> Mon Jan 24 13:05:20 EST 2005: Host: dwarme-lx, Restarted=1, counter=6
> Mon Jan 24 13:05:21 EST 2005: Host: dwarme-lx, Restarted=1, counter=7
> Mon Jan 24 13:05:22 EST 2005: Host: dwarme-lx, Restarted=1, counter=8
> Mon Jan 24 13:05:22 EST 2005: Host: dwarme-lx, Restarted: 1, counter: 8,
> Signal: Usr2 <------ User signal 2 received
> Mon Jan 24 13:05:22 EST 2005: Host: dwarme-lx, Restarted=1, counter=9
> Mon Jan 24 13:05:23 EST 2005: Host: dwarme-lx, Restarted=1, counter=10
> Mon Jan 24 13:05:24 EST 2005: Host: dwarme-lx, Restarted=1, counter=11
> Mon Jan 24 13:05:25 EST 2005: Host: dwarme-lx, Restarted=1, counter=12
> Mon Jan 24 13:05:26 EST 2005: Host: dwarme-lx, Restarted=1, counter=13
> <----- Job dies 5 seconds after USR2
> 
> 
> 
> On Fri, 2005-01-21 at 18:37, Ron Chen wrote: 
> 
> > Can you reproduce the problem with the command line
> > tool "qmod"?
> > 
> > And, you can use mailer to clean up the aborted jobs,
> > since the "mailer" can be anything, so you can point
> > it to a script, and in the script, do whatever you
> > want (since it runs as root).
> > 
> >  -Ron
> > 
> > -- Dan Gruhn <Dan.Gruhn at Group-W-Inc.com> wrote:
> > > Just by clicking the Reschedule button in the job
> > > display dialog box.
> > 
> > 
> > 
> > 		
> > __________________________________ 
> > Do you Yahoo!? 
> > Yahoo! Mail - 250MB free storage. Do more. Manage less. 
> > http://info.mail.yahoo.com/mail_250
> > 
> > --
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list