[GE users] Cleaning processes that are in different process group @ job end

Jeff Putsch putsch at mxim.com
Wed Apr 14 15:30:34 BST 2004


Thanks,

I'll work on a terminate_method based approach.

Does SGE still send the USR2 signals before calling terminate_method?

Jeff.

On Apr 13, 2004, at 11:45 PM, Joachim Gabler wrote:

> Hi Jeff,
>
> the terminate_method should be suited for killing such jobs.
> It is called by sge_shepherd instead of killing the job's processes 
> directly (via process group).
>
> suspend_method and resume_method provide the same mechanism for 
> suspending/resuming jobs.
>
>    Joachim
>
> Jeff Putsch schrieb:
>
>> Howdy,
>>
>> We've got gridengine jobs that occasionally leave processes behind. 
>> The basic problem is our jobs start related processes that end up in 
>> a different process group and are not always cleaned up at the end of 
>> the job. I need to get these reliably terminated when a job is 
>> terminated.
>>
>> I'm running SGE 5.3 Beta 1. I know it's an old version and if the 
>> behavior described below is fixed in newer versions please let me 
>> know. If not, any suggestions for solving my problem will be greatly 
>> appreciated.
>>
>> Specifically, we're having problems with the way Cadence's ocean 
>> scripting/simulator control environment launches the simulators (e.g. 
>> spectre).
>>
>> Here, in a nutshell is what happens:
>>
>>   sge_shepherd (pid A)
>>    job_script  (pid B, ppid A, pgid G)
>>      ocean (pid C, ppid B, pgid G)
>>
>>   cdsServIpc (pid D, ppid 1, pgid G)
>>    spectre (pid E, ppid D, pgid T)
>>
>> Basically, the job_script starts "ocean". Ocean starts an IPC daemon, 
>> "cdsServIpc". Ocean request the simulator, "spectre", gets launched. 
>> The cdsServIpc process launches the simulator. I've left out a few 
>> intermediate shells that get launched in the process, but the 
>> relationship is maintained.
>>
>> As illustrated above, most of the the processes are in the same 
>> process group "G". Unfortunately, the simulators started via 
>> "cdsServIpc" end up in a different process group "T". This process 
>> group is not known to GridEngine. When a job is terminated, or 
>> suspended (e.g. via "qdel"), the simulator (spectre) is not reliably 
>> terminated or suspended.
>>
>> I've tried replacing "ocean" with a script that catches the USR1 and 
>> USR2 signals, then submitting jobs with "qsub -notify". My script 
>> tries to kill the extra spectre jobs, BUT (and it's a big BUT) all of 
>> the other processes in process group "G" are already gone by the time 
>> my "ocean" script receives the USR2 signal. Because of this, I have 
>> no way to locate "spectre" processes that need to be cleaned up (they 
>> are indeed still running).
>>
>> Basically, it seems like GridEngine is killing all the members of the 
>> process group "G" except the "ocean" script BEFORE it sends the USR2 
>> signal to the "ocean" script. That behavior seems backwards.
>>
>> I would like to solve this problem. I'm not sure the 
>> "terminate_method" parameter provides me a solution. If it is called 
>> before (or instead of) the general killing of members of the process 
>> group "G", then it may give me a solution. If not, then
>> I still have a problem.
>>
>> Any insight, feedback, or assistance is greatly appreciated.
>>
>> Thanks, in advance,
>>
>> Jeff.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list