Opened 6 years ago

Closed 5 years ago

Last modified 5 years ago

#1483 closed defect (fixed)

Prevent cgroup/cpuset code from killing shepherd at job en

Reported by: markdixon Owned by: Mark Dixon <m.c.dixon@…>
Priority: normal Milestone:
Component: sge Version: 8.1.5
Severity: minor Keywords:
Cc:

Description

Prevent cgroup/cpuset code from killing shepherd at job end

When the execd_params option USE_CGROUPS is enabled, the cgroup/cpuset
cleanup code checks for and kills processes related to the job. This
includes the shepherd, triggering the job cleanup signal handler. However,
as the execd also kills the shepherd elsewhere, this can cause the job
cleanup code to be traversed twice as many times as usual.

This has been seen to be a problem when the node running the job master
qrsh's back into itself. In that case, the most obvious symptoms are:

  • Messages of the following form in the execd logs:

10/14/2013 12:15:23| main|comp1|W|rogue process(es) found for task 1353.1
10/14/2013 12:15:23| main|comp1|E|shepherd of job 1353.1 died through signal = 9
10/14/2013 12:15:23| main|comp1|E|abnormal termination of shepherd for job 1353.1: "exit_status" file is empty
10/14/2013 12:15:23| main|comp1|E|can't open usage file "active_jobs/1353.1/usage" for job 1353.1: No such file or directory
10/14/2013 12:15:23| main|comp1|E|shepherd exited with exit status 19: before writing exit_status

  • A job failure email sent to adminmail
  • The job start_time / end_time entries in the accounting file are 0

(interpreted as -/- in qacct)

Suggested patch to skip the shepherd is attached.

All the best,

Mark
--


Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


0001-Prevent-cgroup-cpuset-code-from-killing-shepherd-at-.patch

Attachments (3)

Change History (9)

Changed 6 years ago by markdixon

Added by email2trac

comment:1 Changed 6 years ago by Mark Dixon <m.c.dixon@…>

  • Owner set to Mark Dixon <m.c.dixon@…>
  • Resolution set to fixed
  • Status changed from new to closed

In 4651/sge:

Fix #1483: Prevent cgroup/cpuset code from killing shepherd at job end

comment:2 Changed 6 years ago by markdixon

Unfortunately, the first patch above (0001-Prevent-cgroup-cpuset-code-from-killing-shepherd-at-.patch) is only an incomplete solution to this problem:- it turns out that the shepherd is a multithreaded program and it only prevents the first thread from being killed.

The two new attached patches (prepared against 8.1.5 + the original patch + the patch attached to #1490) try to resolve this.

Mark

comment:3 Changed 6 years ago by markdixon

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:4 Changed 5 years ago by Mark Dixon <m.c.dixon@…>

  • Resolution set to fixed
  • Status changed from reopened to closed

In 4695/sge:

Fix #1483: Really prevent cgroup/cpuset code from killing shepherd at job end
The execd previously went through a cgroup task list to find out what
to kill at the end of a job. This lists threads. threads.

This commit causes the execd:

  • To kill processes (Tgid's) rather than threads (tid's) - hopefully a

nicer way to get rid of multithreaded processes.

  • Compare Tgid's against shepherd pid, to do a better job of avoiding

killing the shepherd.

comment:5 Changed 5 years ago by Dave Love <d.love@…>

In 4696/sge:

Resolve conflict with Fix #1483: Really prevent ...

comment:6 Changed 5 years ago by Dave Love <d.love@…>

In 4698/sge:

In remove_shepherd_cpuset, only kill processes, and clean up #1483 changes
Refs #1483

Note: See TracTickets for help on using tickets.