[GE users] Cleanup on Rescheduling and Deleting

Reuti reuti at staff.uni-marburg.de
Thu Jan 27 14:47:39 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

> 2) If a queue is suspended, doing a reschedule on the job will cause a
> the job to restart, but the original job still stays.  If the queue is
> unsuspended, then the job continues on as if nothing happened.  Deleting
> a job on a suspended queue seems to work okay.

What about "abusing" the checkpointing interface in this case? When the queue 
is suspended, the job will be migrated i.e. rescheduled and deleted on the node 
it was running on? You don't have to setup a complete checkpointing interface 
at all. Just leave all the procedure entries empty ("NONE") and select "xr" for 
the "when" in the chckpointing interface (userdefined) and specify the used 
interface when you submit the job. - Reuti

> 
> Could this explain some of the disappearing jobs that I've noticed being
> mentioned?
> 
> Dan
> 
> On Tue, 2005-01-25 at 04:34, Reuti wrote:
> 
> > Quoting Ron Chen <ron_chen_123 at yahoo.com>:
> > 
> > > Can your job scripts check if the environment var
> > > $RESTARTED to the number of times SGE has restarted
> > > it?
> > 
> > For me, $RESTARTED is only 0 or 1. Unless you are using application-level
> 
> > checkpointing, then it's always 2 in case it is restarted. But it would be
> 
> > nice, if it would count the number of restarts.
> >  
> > > And as an optimization, when $RESTARTED is 0, then
> > > don't sleep or clear the job output file.
> > > 
> > > BTW, I am not getting the behaviour you are getting.
> > > SGE always waits for the rescheduled jobs. Can you
> > > post  a sample job script?
> > 
> > I can reproduce the behavior on 6.0u1 on lx24_amd64 and 5.3p6 on x86. I
> checked 
> > the clocks on the master and slaves and got around 30 seconds in both
> cases, 
> > until the old job really is killed.
> > 
> > > --- Dan Gruhn <Dan.Gruhn at Group-W-Inc.com> wrote:
> > > > Hi Reuti,
> > > > 
> > > > Yes, delaying my script for a minute would be a work
> > > > around for now. 
> > > > However, I am trying to squeeze as much out of my
> > > > machines as I can and
> > > > I am thinking that SGE's behavior in this case is
> > > > wrong.  It should not
> > > > be running the same job at the same time on
> > > > different CPUs under these
> > > > or any other circumstances.
> > > > 
> > > > I think the proper sequence of events should be:
> > > > 
> > > > 1) Reschedule is requested
> > > > 2) Job 1 gets the USR2 signal
> > > > 3) After the notify time, job 1 exits
> > > > 4) Job 2 is now scheduled to be run.
> > > > 
> > > > Does this seem right to you?
> > 
> > Yes, agreed. The interesting thing is, that the job is immediately removed
> in 
> > the qstat output from the old node. I mean, in case of a qdel, you can 
> > sometimes see the job staying there for some additional seconds until it 
> > disappears.
> > 
> > Cheers - Reuti
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list