[GE users] Cleaning processes that are in different process group @ job end
putsch at mxim.com
Tue Apr 13 21:50:37 BST 2004
We've got gridengine jobs that occasionally leave processes behind. The
basic problem is our jobs start related processes that end up in a
different process group and are not always cleaned up at the end of the
job. I need to get these reliably terminated when a job is terminated.
I'm running SGE 5.3 Beta 1. I know it's an old version and if the
behavior described below is fixed in newer versions please let me know.
If not, any suggestions for solving my problem will be greatly
Specifically, we're having problems with the way Cadence's ocean
scripting/simulator control environment launches the simulators (e.g.
Here, in a nutshell is what happens:
sge_shepherd (pid A)
job_script (pid B, ppid A, pgid G)
ocean (pid C, ppid B, pgid G)
cdsServIpc (pid D, ppid 1, pgid G)
spectre (pid E, ppid D, pgid T)
Basically, the job_script starts "ocean". Ocean starts an IPC daemon,
"cdsServIpc". Ocean request the simulator, "spectre", gets launched.
The cdsServIpc process launches the simulator. I've left out a few
intermediate shells that get launched in the process, but the
relationship is maintained.
As illustrated above, most of the the processes are in the same process
group "G". Unfortunately, the simulators started via "cdsServIpc" end
up in a different process group "T". This process group is not known to
GridEngine. When a job is terminated, or suspended (e.g. via "qdel"),
the simulator (spectre) is not reliably terminated or suspended.
I've tried replacing "ocean" with a script that catches the USR1 and
USR2 signals, then submitting jobs with "qsub -notify". My script tries
to kill the extra spectre jobs, BUT (and it's a big BUT) all of the
other processes in process group "G" are already gone by the time my
"ocean" script receives the USR2 signal. Because of this, I have no way
to locate "spectre" processes that need to be cleaned up (they are
indeed still running).
Basically, it seems like GridEngine is killing all the members of the
process group "G" except the "ocean" script BEFORE it sends the USR2
signal to the "ocean" script. That behavior seems backwards.
I would like to solve this problem. I'm not sure the "terminate_method"
parameter provides me a solution. If it is called before (or instead
of) the general killing of members of the process group "G", then it
may give me a solution. If not, then
I still have a problem.
Any insight, feedback, or assistance is greatly appreciated.
Thanks, in advance,
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users