Opened 6 years ago

Closed 6 years ago

#1459 closed defect (duplicate)

USE_CGROUPS sets host in error state

Reported by: mikaelb Owned by:
Priority: normal Milestone:
Component: sge Version: 8.1.3
Severity: major Keywords:
Cc:

Description

I have been testing the USE_CGROUPS option that is available to execd. When USE_CGROUPS is enabled it works fine to submit a single job to a queue instance on an execution node. However, if a second job is submitted to the same queue instance, it fails and sets the queue instance in error state due to that the shepherd exited with return code 7. The shepherd trace gives the this:

Shepherd trace:
03/13/2013 22:39:47 [0:17310]: shepherd called with uid = 0, euid = 0
03/13/2013 22:39:47 [400:17310]: starting up 8.1.3
03/13/2013 22:39:47 [400:17310]: can't open file pid: Permission denied

Jobs that successfully start have job spool directories owned by the gridadmin administrative user (the user SGE runs as), while the spool directories of the failed jobs are still owned by root.
If I turn off USE_CGROUPS everything works ok.
Initially I thought this was som race condition which can be triggered when jobs are started rapidly, but some more testing showed that it was when a second job was started on the same execution host.

Change History (3)

comment:1 Changed 6 years ago by dlove

Sorry for the delay in investigating this; I was working with different
code to the released version. I'll look at it soon.

comment:2 Changed 6 years ago by markdixon

Hi,

I think I've opened a duplicate ticket for this bug (sorry). You might find that the suggested patch attached to #1480 helps.

Mark

comment:3 Changed 6 years ago by dlove

  • Resolution set to duplicate
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.