#1483 closed defect (fixed)
Prevent cgroup/cpuset code from killing shepherd at job en
Reported by: | markdixon | Owned by: | Mark Dixon <m.c.dixon@…> |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 8.1.5 |
Severity: | minor | Keywords: | |
Cc: |
Description
Prevent cgroup/cpuset code from killing shepherd at job end
When the execd_params option USE_CGROUPS is enabled, the cgroup/cpuset
cleanup code checks for and kills processes related to the job. This
includes the shepherd, triggering the job cleanup signal handler. However,
as the execd also kills the shepherd elsewhere, this can cause the job
cleanup code to be traversed twice as many times as usual.
This has been seen to be a problem when the node running the job master
qrsh's back into itself. In that case, the most obvious symptoms are:
- Messages of the following form in the execd logs:
10/14/2013 12:15:23| main|comp1|W|rogue process(es) found for task 1353.1
10/14/2013 12:15:23| main|comp1|E|shepherd of job 1353.1 died through signal = 9
10/14/2013 12:15:23| main|comp1|E|abnormal termination of shepherd for job 1353.1: "exit_status" file is empty
10/14/2013 12:15:23| main|comp1|E|can't open usage file "active_jobs/1353.1/usage" for job 1353.1: No such file or directory
10/14/2013 12:15:23| main|comp1|E|shepherd exited with exit status 19: before writing exit_status
- A job failure email sent to adminmail
- The job start_time / end_time entries in the accounting file are 0
(interpreted as -/- in qacct)
Suggested patch to skip the shepherd is attached.
All the best,
Mark
--
Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
0001-Prevent-cgroup-cpuset-code-from-killing-shepherd-at-.patch
Attachments (3)
Change History (9)
Changed 7 years ago by markdixon
comment:1 Changed 7 years ago by Mark Dixon <m.c.dixon@…>
- Owner set to Mark Dixon <m.c.dixon@…>
- Resolution set to fixed
- Status changed from new to closed
In 4651/sge:
Changed 7 years ago by markdixon
Changed 7 years ago by markdixon
comment:2 Changed 7 years ago by markdixon
Unfortunately, the first patch above (0001-Prevent-cgroup-cpuset-code-from-killing-shepherd-at-.patch) is only an incomplete solution to this problem:- it turns out that the shepherd is a multithreaded program and it only prevents the first thread from being killed.
The two new attached patches (prepared against 8.1.5 + the original patch + the patch attached to #1490) try to resolve this.
Mark
comment:3 Changed 7 years ago by markdixon
- Resolution fixed deleted
- Status changed from closed to reopened
comment:4 Changed 7 years ago by Mark Dixon <m.c.dixon@…>
- Resolution set to fixed
- Status changed from reopened to closed
In 4695/sge:
comment:5 Changed 7 years ago by Dave Love <d.love@…>
In 4696/sge:
comment:6 Changed 7 years ago by Dave Love <d.love@…>
In 4698/sge:
Added by email2trac