Custom Query (431 matches)


Show under each result:

Results (121 - 123 of 431)

Ticket Resolution Summary Owner Reporter
#1583 fixed Project object usage in spool should only be updated if it has changed Mark Dixon <m.c.dixon@…> markdixon

From the commit:

Fix xxx only update project usage in spool if it has changed

Like user objects, project objects used to only be updated in the spool if they had changed. Monster change AA-2007-08-20-0 splitted user and project usage into separate data structures but neglected to filter project objects on scheduling sequence number.

This commit causes project objects in the spool only to be updated if they have changed, instead of every scheduling interval.

Commit prepared against 8.1.9

Note that usage stored in the spool can still end up considerably out of date due to #1554.

#1480 fixed Prevent root-owned files in execd active_job spool area markdixon

The new cgroup/cpuset code uses a couple of routines for switching effective uid/gid which appear to be causing some problems.

Some of the side symptoms include the following files in the execd spool sometimes being owned by root:

active_jobs/<JID>.<TASK>/config active_jobs/<JID>.<TASK>/environment active_jobs/<JID>.<TASK>/pe_hostfile active_jobs/<JID>.<TASK>/<NUM>.<HOST>/

That last entry is a directory created for a SLAVE task. It being root-owned can cause jobs to fail with a "can't open pid file" error message.

The execd appears to have the correct euid/egid when entering the cgroup code, so I have removed the offending function calls. I don't know if there's a good reason for them that I've not noticed in limited testing.

Potential patch attached.


Mark --

Mark Dixon Email : m.c.dixon@… HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK


#1483 fixed Prevent cgroup/cpuset code from killing shepherd at job en Mark Dixon <m.c.dixon@…> markdixon

Prevent cgroup/cpuset code from killing shepherd at job end

When the execd_params option USE_CGROUPS is enabled, the cgroup/cpuset cleanup code checks for and kills processes related to the job. This includes the shepherd, triggering the job cleanup signal handler. However, as the execd also kills the shepherd elsewhere, this can cause the job cleanup code to be traversed twice as many times as usual.

This has been seen to be a problem when the node running the job master qrsh's back into itself. In that case, the most obvious symptoms are:

  • Messages of the following form in the execd logs:

10/14/2013 12:15:23| main|comp1|W|rogue process(es) found for task 1353.1 10/14/2013 12:15:23| main|comp1|E|shepherd of job 1353.1 died through signal = 9 10/14/2013 12:15:23| main|comp1|E|abnormal termination of shepherd for job 1353.1: "exit_status" file is empty 10/14/2013 12:15:23| main|comp1|E|can't open usage file "active_jobs/1353.1/usage" for job 1353.1: No such file or directory 10/14/2013 12:15:23| main|comp1|E|shepherd exited with exit status 19: before writing exit_status

  • A job failure email sent to adminmail
  • The job start_time / end_time entries in the accounting file are 0

(interpreted as -/- in qacct)

Suggested patch to skip the shepherd is attached.

All the best,

Mark --

Mark Dixon Email : m.c.dixon@… HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK


Note: See TracQuery for help on using queries.