Opened 7 years ago
#1496 new defect
Multiple qrsh's to the same slave node from same job cause cgroup problems
Reported by: | markdixon | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 8.1.5 |
Severity: | minor | Keywords: | |
Cc: |
Description
I'm running 8.1.5 plus some local patches to help make the cgroup functionality work (c.f. #1477, #1480, #1483, #1490 I think). Hopefully that won't make this report irrelevant.
We have users with jobs that call openmpi's mpirun several times in a row. We find that subsequent calls fail unless we put a short (1 second) sleep between them.
I presume that cgroup create/cleanup code is being called out of order here. I've not really done any debugging, but I couldn't reproduce it with a simple script with a number of sequential qrsh's in it:- I wonder if mpirun returns to the shell while it's still cleaning up qrsh's.
Mark
Note: See
TracTickets for help on using
tickets.