[GE users] Cleaning processes that are in different process group @ job end

Joachim Gabler joga at sun.com
Wed Apr 14 07:45:46 BST 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Jeff,

the terminate_method should be suited for killing such jobs.
It is called by sge_shepherd instead of killing the job's processes 
directly (via process group).

suspend_method and resume_method provide the same mechanism for 
suspending/resuming jobs.

    Joachim

Jeff Putsch schrieb:

> Howdy,
>
> We've got gridengine jobs that occasionally leave processes behind. 
> The basic problem is our jobs start related processes that end up in a 
> different process group and are not always cleaned up at the end of 
> the job. I need to get these reliably terminated when a job is 
> terminated.
>
> I'm running SGE 5.3 Beta 1. I know it's an old version and if the 
> behavior described below is fixed in newer versions please let me 
> know. If not, any suggestions for solving my problem will be greatly 
> appreciated.
>
> Specifically, we're having problems with the way Cadence's ocean 
> scripting/simulator control environment launches the simulators (e.g. 
> spectre).
>
> Here, in a nutshell is what happens:
>
>   sge_shepherd (pid A)
>    job_script  (pid B, ppid A, pgid G)
>      ocean (pid C, ppid B, pgid G)
>
>   cdsServIpc (pid D, ppid 1, pgid G)
>    spectre (pid E, ppid D, pgid T)
>
> Basically, the job_script starts "ocean". Ocean starts an IPC daemon, 
> "cdsServIpc". Ocean request the simulator, "spectre", gets launched. 
> The cdsServIpc process launches the simulator. I've left out a few 
> intermediate shells that get launched in the process, but the 
> relationship is maintained.
>
> As illustrated above, most of the the processes are in the same 
> process group "G". Unfortunately, the simulators started via 
> "cdsServIpc" end up in a different process group "T". This process 
> group is not known to GridEngine. When a job is terminated, or 
> suspended (e.g. via "qdel"), the simulator (spectre) is not reliably 
> terminated or suspended.
>
> I've tried replacing "ocean" with a script that catches the USR1 and 
> USR2 signals, then submitting jobs with "qsub -notify". My script 
> tries to kill the extra spectre jobs, BUT (and it's a big BUT) all of 
> the other processes in process group "G" are already gone by the time 
> my "ocean" script receives the USR2 signal. Because of this, I have no 
> way to locate "spectre" processes that need to be cleaned up (they are 
> indeed still running).
>
> Basically, it seems like GridEngine is killing all the members of the 
> process group "G" except the "ocean" script BEFORE it sends the USR2 
> signal to the "ocean" script. That behavior seems backwards.
>
> I would like to solve this problem. I'm not sure the 
> "terminate_method" parameter provides me a solution. If it is called 
> before (or instead of) the general killing of members of the process 
> group "G", then it may give me a solution. If not, then
> I still have a problem.
>
> Any insight, feedback, or assistance is greatly appreciated.
>
> Thanks, in advance,
>
> Jeff.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list