Opened 10 years ago

Last modified 4 years ago

#674 new defect

IZ3035: ENABLE_ADDGRP_KILL does not work for qrsh with command

Reported by: joga Owned by:
Priority: low Milestone:
Component: sge Version: 6.0u7
Severity: minor Keywords: execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3035]

        Issue #:      3035             Platform:     All      Reporter: joga (joga)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      6.0u7       CC:    None defined
        Status:       NEW              Priority:     P4
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    joga (joga)
      QA Contact:     pollinger
          URL:
       * Summary:     ENABLE_ADDGRP_KILL does not work for qrsh with command
   Status whiteboard:
      Attachments:

     Issue 3035 blocks:
   Votes for issue 3035:


   Opened: Tue May 19 01:56:00 -0700 2009 
------------------------


Setting ENABLE_ADDGRP_KILL in the execd_params shall kill child processes of a job by additional group id, to catch processes that leave the
jobs process group.

This works fine for qsub, qlogin, qrsh without command,
but not for qrsh with command.

   ------- Additional comments from joga Tue May 19 01:57:15 -0700 2009 -------
Evaluation

Can be reproduced with the following script:
#!/bin/sh
#$ -S /bin/sh

if [ $# -ne 1 ]; then
   echo "usage: $0 <sleep time>"
fi

SLEEP=$1

setpgrp /bin/sh -c "id -a ; /usr/bin/sleep $SLEEP" &
wait
exit 0

It's in daemons/shepherd/shepherd.c:
        if (first_kill == 0 || sig != SIGKILL || is_qrsh == false) {

Killing by additional group id is explicitly disabled for qrsh jobs with the first kill.
A second kill operation will never be done when the job script exits within reasonable time on SIGKILL - which is to be expected.

Problem (why it probably has not been done) is, that the qrsh_starter (as well as rshd if we don't use builtin rsh_daemon) would get killed
as well.
But if the qrsh_starter is killed instead of exiting, the exit code of the command will get lost.

Change History (2)

comment:1 Changed 4 years ago by markdixon

  • Severity set to minor

Confirmed that this is still an issue with ENABLE_ADDGRP_KILL in SoGE 8.1.1, and the relevant code still appears intact in 8.1.9.

The alternative method of controlling jobs with cgroups avoids this problem.

For our site, this mainly exhibits itself with the slave tasks of tightly-integrated parallel jobs not being cleaned-up properly.

This won't be seen with openmpi jobs, presumably because I think its ranks self-destruct when the master processes are killed. Has been seen with mvapich2 2.1 / hydra launcher and some versions of intelmpi. Don't know if they can be persuaded to self-destruct in this situation.

comment:2 Changed 4 years ago by dlove

SGE <sge-bugs@…> writes:

This won't be seen with openmpi jobs, presumably because I think its ranks
self-destruct when the master processes are killed.

For what it's worth, I think it does happen with openmpi here -- at
least with jobs which appear to be using openmpi.

Note: See TracTickets for help on using tickets.