[GE users] Cleanup on Rescheduling and Deleting

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Thu Jan 27 15:18:59 GMT 2005


Reuti,

"Abusing" the checkpointing interface in this way works quite nicely,
thank you.  However, it seems that one ought to be able to reschedule a
job from a suspended queue.

I will now collect all of this together and write up an issue report.

Dan

On Thu, 2005-01-27 at 09:47, Reuti wrote:

> > 2) If a queue is suspended, doing a reschedule on the job will cause a
> > the job to restart, but the original job still stays.  If the queue is
> > unsuspended, then the job continues on as if nothing happened.  Deleting
> > a job on a suspended queue seems to work okay.
> 
> What about "abusing" the checkpointing interface in this case? When the queue 
> is suspended, the job will be migrated i.e. rescheduled and deleted on the node 
> it was running on? You don't have to setup a complete checkpointing interface 
> at all. Just leave all the procedure entries empty ("NONE") and select "xr" for 
> the "when" in the chckpointing interface (userdefined) and specify the used 
> interface when you submit the job. - Reuti
> 
> > 
> > Could this explain some of the disappearing jobs that I've noticed being
> > mentioned?
> > 
> > Dan
> > 
> > On Tue, 2005-01-25 at 04:34, Reuti wrote:
> > 
> > > Quoting Ron Chen <ron_chen_123 at yahoo.com>:
> > > 
> > > > Can your job scripts check if the environment var
> > > > $RESTARTED to the number of times SGE has restarted
> > > > it?
> > > 
> > > For me, $RESTARTED is only 0 or 1. Unless you are using application-level
> > 
> > > checkpointing, then it's always 2 in case it is restarted. But it would be
> > 
> > > nice, if it would count the number of restarts.
> > >  
> > > > And as an optimization, when $RESTARTED is 0, then
> > > > don't sleep or clear the job output file.
> > > > 
> > > > BTW, I am not getting the behaviour you are getting.
> > > > SGE always waits for the rescheduled jobs. Can you
> > > > post  a sample job script?
> > > 
> > > I can reproduce the behavior on 6.0u1 on lx24_amd64 and 5.3p6 on x86. I
> > checked 
> > > the clocks on the master and slaves and got around 30 seconds in both
> > cases, 
> > > until the old job really is killed.
> > > 
> > > > --- Dan Gruhn <Dan.Gruhn at Group-W-Inc.com> wrote:
> > > > > Hi Reuti,
> > > > > 
> > > > > Yes, delaying my script for a minute would be a work
> > > > > around for now. 
> > > > > However, I am trying to squeeze as much out of my
> > > > > machines as I can and
> > > > > I am thinking that SGE's behavior in this case is
> > > > > wrong.  It should not
> > > > > be running the same job at the same time on
> > > > > different CPUs under these
> > > > > or any other circumstances.
> > > > > 
> > > > > I think the proper sequence of events should be:
> > > > > 
> > > > > 1) Reschedule is requested
> > > > > 2) Job 1 gets the USR2 signal
> > > > > 3) After the notify time, job 1 exits
> > > > > 4) Job 2 is now scheduled to be run.
> > > > > 
> > > > > Does this seem right to you?
> > > 
> > > Yes, agreed. The interesting thing is, that the job is immediately removed
> > in 
> > > the qstat output from the old node. I mean, in case of a qdel, you can 
> > > sometimes see the job staying there for some additional seconds until it 
> > > disappears.
> > > 
> > > Cheers - Reuti
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > > 
> > 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list