Opened 8 years ago
Closed 7 years ago
#1459 closed defect (duplicate)
USE_CGROUPS sets host in error state
Reported by: | mikaelb | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 8.1.3 |
Severity: | major | Keywords: | |
Cc: |
Description
I have been testing the USE_CGROUPS option that is available to execd. When USE_CGROUPS is enabled it works fine to submit a single job to a queue instance on an execution node. However, if a second job is submitted to the same queue instance, it fails and sets the queue instance in error state due to that the shepherd exited with return code 7. The shepherd trace gives the this:
Shepherd trace: 03/13/2013 22:39:47 [0:17310]: shepherd called with uid = 0, euid = 0 03/13/2013 22:39:47 [400:17310]: starting up 8.1.3 03/13/2013 22:39:47 [400:17310]: can't open file pid: Permission denied
Jobs that successfully start have job spool directories owned by the gridadmin administrative user (the user SGE runs as), while the spool directories of the failed jobs are still owned by root.
If I turn off USE_CGROUPS everything works ok.
Initially I thought this was som race condition which can be triggered when jobs are started rapidly, but some more testing showed that it was when a second job was started on the same execution host.
Change History (3)
comment:1 Changed 8 years ago by dlove
comment:2 Changed 7 years ago by markdixon
Hi,
I think I've opened a duplicate ticket for this bug (sorry). You might find that the suggested patch attached to #1480 helps.
Mark
comment:3 Changed 7 years ago by dlove
- Resolution set to duplicate
- Status changed from new to closed
Sorry for the delay in investigating this; I was working with different
code to the released version. I'll look at it soon.