Opened 6 years ago

#1496 new defect

Multiple qrsh's to the same slave node from same job cause cgroup problems

Reported by: markdixon Owned by:
Priority: normal Milestone:
Component: sge Version: 8.1.5
Severity: minor Keywords:
Cc:

Description

I'm running 8.1.5 plus some local patches to help make the cgroup functionality work (c.f. #1477, #1480, #1483, #1490 I think). Hopefully that won't make this report irrelevant.

We have users with jobs that call openmpi's mpirun several times in a row. We find that subsequent calls fail unless we put a short (1 second) sleep between them.

I presume that cgroup create/cleanup code is being called out of order here. I've not really done any debugging, but I couldn't reproduce it with a simple script with a number of sequential qrsh's in it:- I wonder if mpirun returns to the shell while it's still cleaning up qrsh's.

Mark

Change History (0)

Note: See TracTickets for help on using tickets.