Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (106 - 108 of 431)

Ticket Resolution Summary Owner Reporter
#1483 fixed Prevent cgroup/cpuset code from killing shepherd at job en Mark Dixon <m.c.dixon@…> markdixon
Description

Prevent cgroup/cpuset code from killing shepherd at job end

When the execd_params option USE_CGROUPS is enabled, the cgroup/cpuset cleanup code checks for and kills processes related to the job. This includes the shepherd, triggering the job cleanup signal handler. However, as the execd also kills the shepherd elsewhere, this can cause the job cleanup code to be traversed twice as many times as usual.

This has been seen to be a problem when the node running the job master qrsh's back into itself. In that case, the most obvious symptoms are:

  • Messages of the following form in the execd logs:

10/14/2013 12:15:23| main|comp1|W|rogue process(es) found for task 1353.1 10/14/2013 12:15:23| main|comp1|E|shepherd of job 1353.1 died through signal = 9 10/14/2013 12:15:23| main|comp1|E|abnormal termination of shepherd for job 1353.1: "exit_status" file is empty 10/14/2013 12:15:23| main|comp1|E|can't open usage file "active_jobs/1353.1/usage" for job 1353.1: No such file or directory 10/14/2013 12:15:23| main|comp1|E|shepherd exited with exit status 19: before writing exit_status

  • A job failure email sent to adminmail
  • The job start_time / end_time entries in the accounting file are 0

(interpreted as -/- in qacct)

Suggested patch to skip the shepherd is attached.

All the best,

Mark --


Mark Dixon Email : m.c.dixon@… HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK


0001-Prevent-cgroup-cpuset-code-from-killing-shepherd-at-.patch

#1490 fixed Gain privileges before execd kills rogue processes Mark Dixon <m.c.dixon@…> markdixon
Description

The rogue process detection enabled when USE_CGROUPS=1 attempts to kill processes as the sge admin user. As that user doesn't normally have the privileges to do so, this patch temporarily gains the privileges of the daemon's starting user (typically root) before killing processes.

Mark --


Mark Dixon Email : m.c.dixon@… HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK


0001-Gain-privileges-before-execd-kills-rogue-processes.patch

#1517 fixed sge_ca only does partial matches against GECOS data Mark Dixon <m.c.dixon@…> markdixon
Description

Hi,

When the sge_ca tool supplied with grid engine is used to create a new certificate, it uses the supplied GECOS data to check if a certificate with the same common name already exists.

That check only does a partial match on CN=<gecos info>, so if a new certificate's GECOS data matches the start of another certificate's GECOS data, it will refuse to create the certificate.

e.g. the cert created by the first command below prevents the second cert from being created:

$SGE_ROOT/util/sgeCA/sge_ca -user "user1:Bobby:bobby@somewhere" $SGE_ROOT/util/sgeCA/sge_ca -user "user2:Bob:bob@somewhere"

The attached patch fixes this, prepared against 8.1.8.

Cheers,

Mark

Note: See TracQuery for help on using queries.