[GE users] Cleanup on Rescheduling and Deleting

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Fri Jan 28 18:26:44 GMT 2005


Reuti,

I've found an unintended consequence to "abusing" the checkpointing
interface like this.  In my case, I am running jobs that try to scavenge
extra CPU cycles from machines that are idle.  If a machine becomes busy
enough, then the suspend threshold kicks in and the job is suspended
until the machine is less busy.  With this workaround, the queued jobs
get restarted each time a machine crosses the suspend threshold.  I've
tried to pretty well fill up the slots on all of our machines, which
worked okay before.  Now, however, jobs are getting requeued a lot and
not much work is getting done.

I was thinking of suspending in terms of a manual operation on a queue,
but it seems the same behavior happens with automatic behavior on an
individual job.

Dan

On Thu, 2005-01-27 at 10:18, Dan Gruhn wrote:

> Reuti,
> 
> "Abusing" the checkpointing interface in this way works quite nicely,
> thank you.  However, it seems that one ought to be able to reschedule
> a job from a suspended queue.
> 
> I will now collect all of this together and write up an issue report.
> 
> Dan
> 
> On Thu, 2005-01-27 at 09:47, Reuti wrote: 
> 
> > > 2) If a queue is suspended, doing a reschedule on the job will cause a
> > > the job to restart, but the original job still stays.  If the queue is
> > > unsuspended, then the job continues on as if nothing happened.  Deleting
> > > a job on a suspended queue seems to work okay.
> > 
> > What about "abusing" the checkpointing interface in this case? When the queue 
> > is suspended, the job will be migrated i.e. rescheduled and deleted on the node 
> > it was running on? You don't have to setup a complete checkpointing interface 
> > at all. Just leave all the procedure entries empty ("NONE") and select "xr" for 
> > the "when" in the chckpointing interface (userdefined) and specify the used 
> > interface when you submit the job. - Reuti
> > 
> > > 
> > > Could this explain some of the disappearing jobs that I've noticed being
> > > mentioned?
> > > 
> > > Dan
> > > 
> > > On Tue, 2005-01-25 at 04:34, Reuti wrote:
> > > 
> > > > Quoting Ron Chen <ron_chen_123 at yahoo.com>:
> > > > 
> > > > > Can your job scripts check if the environment var
> > > > > $RESTARTED to the number of times SGE has restarted
> > > > > it?
> > > > 
> > > > For me, $RESTARTED is only 0 or 1. Unless you are using application-level
> > > 
> > > > checkpointing, then it's always 2 in case it is restarted. But it would be
> > > 
> > > > nice, if it would count the number of restarts.
> > > >  
> > > > > And as an optimization, when $RESTARTED is 0, then
> > > > > don't sleep or clear the job output file.
> > > > > 
> > > > > BTW, I am not getting the behaviour you are getting.
> > > > > SGE always waits for the rescheduled jobs. Can you
> > > > > post  a sample job script?
> > > > 
> > > > I can reproduce the behavior on 6.0u1 on lx24_amd64 and 5.3p6 on x86. I
> > > checked 
> > > > the clocks on the master and slaves and got around 30 seconds in both
> > > cases, 
> > > > until the old job really is killed.
> > > > 
> > > > > --- Dan Gruhn <Dan.Gruhn at Group-W-Inc.com> wrote:
> > > > > > Hi Reuti,
> > > > > > 
> > > > > > Yes, delaying my script for a minute would be a work
> > > > > > around for now. 
> > > > > > However, I am trying to squeeze as much out of my
> > > > > > machines as I can and
> > > > > > I am thinking that SGE's behavior in this case is
> > > > > > wrong.  It should not
> > > > > > be running the same job at the same time on
> > > > > > different CPUs under these
> > > > > > or any other circumstances.
> > > > > > 
> > > > > > I think the proper sequence of events should be:
> > > > > > 
> > > > > > 1) Reschedule is requested
> > > > > > 2) Job 1 gets the USR2 signal
> > > > > > 3) After the notify time, job 1 exits
> > > > > > 4) Job 2 is now scheduled to be run.
> > > > > > 
> > > > > > Does this seem right to you?
> > > > 
> > > > Yes, agreed. The interesting thing is, that the job is immediately removed
> > > in 
> > > > the qstat output from the old node. I mean, in case of a qdel, you can 
> > > > sometimes see the job staying there for some additional seconds until it 
> > > > disappears.
> > > > 
> > > > Cheers - Reuti
> > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > > > 
> > > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list