[GE users] Pb w/ suspending job ...

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Fri May 20 14:28:37 BST 2005


Chanh,

When the -notify option is used, your job will get sent the USR1 signal
when a suspend is about to occur.  If your job is not set up to catch
this signal, the default is to kill the job.  There has been some talk
on this list about getting the SGE job manager to set USR1 and USR2
(used before a kill is about to occur). to ignore then a job could
change to catch it if it wanted to, but that has not yet happened.

You could continue to use the -notify option and just have your jobs set
USR1 to ignore when they start up.  We use the -notify around here to
have a cleanup job when it is about to be rescheduled.

Dan

On Fri, 2005-05-20 at 09:20, TRAN Chanh wrote:

> Reuti wrote:
> 
> > Were these just plain serial jobs? There is indeed the possibility to 
> > change the suspend/resume method, but the built-in:
> 
> Currently, all the jobs 're plain serail one.
> BTW, I just discovered that I have '-notify'  option in my 'qsub' & by 
> eliminating this now my 'suspend' pb is gone.
> I must say I'm happy w/ this but nevertheless remain interested in 
> having an explanation why I did have this effect ...
> 
> Actually, next step for me is to suspend 'multi-proc' jobs & 
> 'multi-node' jobs & hope everything 'll work out fine
> 
> Thanks again,
> Chanh
> 
> >
> > kill -stop -- -<pid>
> >
> > should stop the whole process group. Did you define any procedures on 
> > your own? Are some forks/threads of your application jumping out of 
> > the process group? - Reuti
> >
> Otherwise, I don't have any specific procedure of my own ...
> 
> >
> > TRAN Chanh wrote:
> >
> >> Hi Reuti,
> >>
> >> Actually, I did try to do this :
> >> 1. via 'qmon->jobs->suspend ....'
> >> 2. qmod -s job_id
> >>
> >> Both 2 bring the same result
> >>
> >> Chanh
> >>
> >> Reuti wrote:
> >>
> >>> Chanh,
> >>>
> >>> which SGE commands did you use in detail to suspend and unsuspend 
> >>> your jobs? - Reuti
> >>>
> >>> TRAN Chanh wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I'm using SGE 5.3p6 & try to have my executing jobs suspended & 
> >>>> have these one 'back-to-work' via 'resume'.
> >>>> What happens is these jobs instead of being suspended like 'kill 
> >>>> -SIGSTOP', they 're all aborted like 'kill -9'.
> >>>> Is there anyway to change this behavior ?
> >>>>
> >>>> Thanks a lot for any help,
> >>>> Chanh
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list