[GE users] Cleaning processes that are in different process group @ job end

Andy Schwierskott andy.schwierskott at sun.com
Thu Apr 15 11:09:55 BST 2004


Jeff,

> I'll work on a terminate_method based approach.
>
> Does SGE still send the USR2 signals before calling terminate_method?

only if the job was submitted with the notify option and a "notify" time is
set to a non-zero value in the queue config.

And

> On Apr 13, 2004, at 11:45 PM, Joachim Gabler wrote:
>
> > Hi Jeff,
> >
> > the terminate_method should be suited for killing such jobs.
> > It is called by sge_shepherd instead of killing the job's processes
> > directly (via process group).
> >
> > suspend_method and resume_method provide the same mechanism for
> > suspending/resuming jobs.
> >
> >    Joachim
> >
> > Jeff Putsch schrieb:
> >
> >> Howdy,
> >>
> >> We've got gridengine jobs that occasionally leave processes behind.
> >> The basic problem is our jobs start related processes that end up in
> >> a different process group and are not always cleaned up at the end of
> >> the job. I need to get these reliably terminated when a job is
> >> terminated.
> >>
> >> I'm running SGE 5.3 Beta 1. I know it's an old version and if the
> >> behavior described below is fixed in newer versions please let me
> >> know. If not, any suggestions for solving my problem will be greatly
> >> appreciated.
> >>
> >> Specifically, we're having problems with the way Cadence's ocean
> >> scripting/simulator control environment launches the simulators (e.g.
> >> spectre).
> >>
> >> Here, in a nutshell is what happens:
> >>
> >>   sge_shepherd (pid A)
> >>    job_script  (pid B, ppid A, pgid G)
> >>      ocean (pid C, ppid B, pgid G)
> >>
> >>   cdsServIpc (pid D, ppid 1, pgid G)
> >>    spectre (pid E, ppid D, pgid T)
> >>
> >> Basically, the job_script starts "ocean". Ocean starts an IPC daemon,
> >> "cdsServIpc". Ocean request the simulator, "spectre", gets launched.
> >> The cdsServIpc process launches the simulator. I've left out a few
> >> intermediate shells that get launched in the process, but the
> >> relationship is maintained.
> >>
> >> As illustrated above, most of the the processes are in the same
> >> process group "G". Unfortunately, the simulators started via
> >> "cdsServIpc" end up in a different process group "T". This process
> >> group is not known to GridEngine. When a job is terminated, or
> >> suspended (e.g. via "qdel"), the simulator (spectre) is not reliably
> >> terminated or suspended.
> >>
> >> I've tried replacing "ocean" with a script that catches the USR1 and
> >> USR2 signals, then submitting jobs with "qsub -notify". My script
> >> tries to kill the extra spectre jobs, BUT (and it's a big BUT) all of
> >> the other processes in process group "G" are already gone by the time
> >> my "ocean" script receives the USR2 signal. Because of this, I have
> >> no way to locate "spectre" processes that need to be cleaned up (they
> >> are indeed still running).
> >>
> >> Basically, it seems like GridEngine is killing all the members of the
> >> process group "G" except the "ocean" script BEFORE it sends the USR2
> >> signal to the "ocean" script. That behavior seems backwards.
> >>
> >> I would like to solve this problem. I'm not sure the
> >> "terminate_method" parameter provides me a solution. If it is called
> >> before (or instead of) the general killing of members of the process
> >> group "G", then it may give me a solution. If not, then
> >> I still have a problem.
> >>
> >> Any insight, feedback, or assistance is greatly appreciated.
> >>
> >> Thanks, in advance,
> >>
> >> Jeff.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


Regards,
Mit freundlichen Gruessen,
Andy
Schwierskott

--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Andy Schwierskott           Tel:     +49 941 3075-200  (x60200)
Sun Grid Engine Engineering Support: +49 941 3075-250  (x60250)
Sun Microsystems GmbH       Fax:     +49 941 3075-222  (x60222)
Dr.-Leo-Ritter-Str. 7       mailto:andy.schwierskott at sun.com
D-93049 Regensburg          http://www.sun.com/gridware

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list