[GE users] Cleanup on Rescheduling and Deleting

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Thu Jan 27 14:15:04 GMT 2005


I've continued to experiment with restarting, the -notify switch and
USR1 and USR2 signals.  Some additional odd behaviors that I've notice:

1) If I use the same load parameter for Load and Suspend Thresholds as
well as the Load Adjustment and I used -notify in qsub, it looks like
what can happen is that the scheduler will start a job, raise the load
by the Load Adjustment value, see that this passes the Suspend Threshold
and send the USR1 signal immediately.  However, my script has not been
fully able to get started enough to be able to ignore the USR1 signal
and so it terminates (USR1 default behavior in Linux).  What I see is
that my job leaves the Pending Jobs list and immediately shows upon the
Finished Jobs list.

2) If a queue is suspended, doing a reschedule on the job will cause a
the job to restart, but the original job still stays.  If the queue is
unsuspended, then the job continues on as if nothing happened.  Deleting
a job on a suspended queue seems to work okay.

Could this explain some of the disappearing jobs that I've noticed being
mentioned?

Dan

On Tue, 2005-01-25 at 04:34, Reuti wrote:

> Quoting Ron Chen <ron_chen_123 at yahoo.com>:
> 
> > Can your job scripts check if the environment var
> > $RESTARTED to the number of times SGE has restarted
> > it?
> 
> For me, $RESTARTED is only 0 or 1. Unless you are using application-level 
> checkpointing, then it's always 2 in case it is restarted. But it would be 
> nice, if it would count the number of restarts.
>  
> > And as an optimization, when $RESTARTED is 0, then
> > don't sleep or clear the job output file.
> > 
> > BTW, I am not getting the behaviour you are getting.
> > SGE always waits for the rescheduled jobs. Can you
> > post  a sample job script?
> 
> I can reproduce the behavior on 6.0u1 on lx24_amd64 and 5.3p6 on x86. I checked 
> the clocks on the master and slaves and got around 30 seconds in both cases, 
> until the old job really is killed.
> 
> > --- Dan Gruhn <Dan.Gruhn at Group-W-Inc.com> wrote:
> > > Hi Reuti,
> > > 
> > > Yes, delaying my script for a minute would be a work
> > > around for now. 
> > > However, I am trying to squeeze as much out of my
> > > machines as I can and
> > > I am thinking that SGE's behavior in this case is
> > > wrong.  It should not
> > > be running the same job at the same time on
> > > different CPUs under these
> > > or any other circumstances.
> > > 
> > > I think the proper sequence of events should be:
> > > 
> > > 1) Reschedule is requested
> > > 2) Job 1 gets the USR2 signal
> > > 3) After the notify time, job 1 exits
> > > 4) Job 2 is now scheduled to be run.
> > > 
> > > Does this seem right to you?
> 
> Yes, agreed. The interesting thing is, that the job is immediately removed in 
> the qstat output from the old node. I mean, in case of a qdel, you can 
> sometimes see the job staying there for some additional seconds until it 
> disappears.
> 
> Cheers - Reuti
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list