Opened 5 years ago

#1512 new defect

cpusets and multiple tasks of parallel job on same node

Reported by: markdixon Owned by:
Priority: normal Milestone:
Component: sge Version: 8.1.6
Severity: minor Keywords:
Cc:

Description

Hi,

I've noticed a problem with the new cgroup/cpuset feature and parallel jobs that use qrsh to launch SLAVE tasks on the same node as the MASTER task (Intel MPI, for example, does this to launch ranks running on the same node as the mpirun command).

When the SLAVE task launched by qrsh exits, the execd cleans up the entire job on that node - killing the job script running under the MASTER task.

If you have a job script with multiple mpirun's in a row, or some post-processing at the end of the script, they will never run (or worse, killed shortly after starting them).

Mark

Change History (0)

Note: See TracTickets for help on using tickets.