[GE users] Cleaning processes that are in different process group @ job end

Jeff Putsch putsch at mxim.com
Fri Apr 16 15:31:29 BST 2004


Thanks for the help and feedback.

Terminate_method is working for me (along with a suspend_method and  
resume_method for similar reasons).

Jeff.

On Apr 15, 2004, at 3:09 AM, Andy Schwierskott wrote:

> Jeff,
>
>> I'll work on a terminate_method based approach.
>>
>> Does SGE still send the USR2 signals before calling terminate_method?
>
> only if the job was submitted with the notify option and a "notify"  
> time is
> set to a non-zero value in the queue config.
>
> And
>
>> On Apr 13, 2004, at 11:45 PM, Joachim Gabler wrote:
>>
>>> Hi Jeff,
>>>
>>> the terminate_method should be suited for killing such jobs.
>>> It is called by sge_shepherd instead of killing the job's processes
>>> directly (via process group).
>>>
>>> suspend_method and resume_method provide the same mechanism for
>>> suspending/resuming jobs.
>>>
>>>    Joachim
>>>
>>> Jeff Putsch schrieb:
>>>
>>>> Howdy,
>>>>
>>>> We've got gridengine jobs that occasionally leave processes behind.
>>>> The basic problem is our jobs start related processes that end up in
>>>> a different process group and are not always cleaned up at the end  
>>>> of
>>>> the job. I need to get these reliably terminated when a job is
>>>> terminated.
>>>>
>>>> I'm running SGE 5.3 Beta 1. I know it's an old version and if the
>>>> behavior described below is fixed in newer versions please let me
>>>> know. If not, any suggestions for solving my problem will be greatly
>>>> appreciated.
>>>>
>>>> Specifically, we're having problems with the way Cadence's ocean
>>>> scripting/simulator control environment launches the simulators  
>>>> (e.g.
>>>> spectre).
>>>>
>>>> Here, in a nutshell is what happens:
>>>>
>>>>   sge_shepherd (pid A)
>>>>    job_script  (pid B, ppid A, pgid G)
>>>>      ocean (pid C, ppid B, pgid G)
>>>>
>>>>   cdsServIpc (pid D, ppid 1, pgid G)
>>>>    spectre (pid E, ppid D, pgid T)
>>>>
>>>> Basically, the job_script starts "ocean". Ocean starts an IPC  
>>>> daemon,
>>>> "cdsServIpc". Ocean request the simulator, "spectre", gets launched.
>>>> The cdsServIpc process launches the simulator. I've left out a few
>>>> intermediate shells that get launched in the process, but the
>>>> relationship is maintained.
>>>>
>>>> As illustrated above, most of the the processes are in the same
>>>> process group "G". Unfortunately, the simulators started via
>>>> "cdsServIpc" end up in a different process group "T". This process
>>>> group is not known to GridEngine. When a job is terminated, or
>>>> suspended (e.g. via "qdel"), the simulator (spectre) is not reliably
>>>> terminated or suspended.
>>>>
>>>> I've tried replacing "ocean" with a script that catches the USR1 and
>>>> USR2 signals, then submitting jobs with "qsub -notify". My script
>>>> tries to kill the extra spectre jobs, BUT (and it's a big BUT) all  
>>>> of
>>>> the other processes in process group "G" are already gone by the  
>>>> time
>>>> my "ocean" script receives the USR2 signal. Because of this, I have
>>>> no way to locate "spectre" processes that need to be cleaned up  
>>>> (they
>>>> are indeed still running).
>>>>
>>>> Basically, it seems like GridEngine is killing all the members of  
>>>> the
>>>> process group "G" except the "ocean" script BEFORE it sends the USR2
>>>> signal to the "ocean" script. That behavior seems backwards.
>>>>
>>>> I would like to solve this problem. I'm not sure the
>>>> "terminate_method" parameter provides me a solution. If it is called
>>>> before (or instead of) the general killing of members of the process
>>>> group "G", then it may give me a solution. If not, then
>>>> I still have a problem.
>>>>
>>>> Any insight, feedback, or assistance is greatly appreciated.
>>>>
>>>> Thanks, in advance,
>>>>
>>>> Jeff.
>>>>
>>>>
>>>> -------------------------------------------------------------------- 
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
>
> Regards,
> Mit freundlichen Gruessen,
> Andy
> Schwierskott
>
> --
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Andy Schwierskott           Tel:     +49 941 3075-200  (x60200)
> Sun Grid Engine Engineering Support: +49 941 3075-250  (x60250)
> Sun Microsystems GmbH       Fax:     +49 941 3075-222  (x60222)
> Dr.-Leo-Ritter-Str. 7       mailto:andy.schwierskott at sun.com
> D-93049 Regensburg          http://www.sun.com/gridware
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list