[GE users] Cleaning processes that are in different process group @ job end

Jeff Putsch putsch at mxim.com
Tue Apr 13 21:50:37 BST 2004


We've got gridengine jobs that occasionally leave processes behind. The 
basic problem is our jobs start related processes that end up in a 
different process group and are not always cleaned up at the end of the 
job. I need to get these reliably terminated when a job is terminated.

I'm running SGE 5.3 Beta 1. I know it's an old version and if the 
behavior described below is fixed in newer versions please let me know. 
If not, any suggestions for solving my problem will be greatly 

Specifically, we're having problems with the way Cadence's ocean 
scripting/simulator control environment launches the simulators (e.g. 

Here, in a nutshell is what happens:

   sge_shepherd (pid A)
    job_script  (pid B, ppid A, pgid G)
      ocean (pid C, ppid B, pgid G)

   cdsServIpc (pid D, ppid 1, pgid G)
    spectre (pid E, ppid D, pgid T)

Basically, the job_script starts "ocean". Ocean starts an IPC daemon, 
"cdsServIpc". Ocean request the simulator, "spectre", gets launched. 
The cdsServIpc process launches the simulator. I've left out a few 
intermediate shells that get launched in the process, but the 
relationship is maintained.

As illustrated above, most of the the processes are in the same process 
group "G". Unfortunately, the simulators started via "cdsServIpc" end 
up in a different process group "T". This process group is not known to 
GridEngine. When a job is terminated, or suspended (e.g. via "qdel"), 
the simulator (spectre) is not reliably terminated or suspended.

I've tried replacing "ocean" with a script that catches the USR1 and 
USR2 signals, then submitting jobs with "qsub -notify". My script tries 
to kill the extra spectre jobs, BUT (and it's a big BUT) all of the 
other processes in process group "G" are already gone by the time my 
"ocean" script receives the USR2 signal. Because of this, I have no way 
to locate "spectre" processes that need to be cleaned up (they are 
indeed still running).

Basically, it seems like GridEngine is killing all the members of the 
process group "G" except the "ocean" script BEFORE it sends the USR2 
signal to the "ocean" script. That behavior seems backwards.

I would like to solve this problem. I'm not sure the "terminate_method" 
parameter provides me a solution. If it is called before (or instead 
of) the general killing of members of the process group "G", then it 
may give me a solution. If not, then
I still have a problem.

Any insight, feedback, or assistance is greatly appreciated.

Thanks, in advance,


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list