Opened 6 years ago
#1512 new defect
cpusets and multiple tasks of parallel job on same node
Reported by: | markdixon | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 8.1.6 |
Severity: | minor | Keywords: | |
Cc: |
Description
Hi,
I've noticed a problem with the new cgroup/cpuset feature and parallel jobs that use qrsh to launch SLAVE tasks on the same node as the MASTER task (Intel MPI, for example, does this to launch ranks running on the same node as the mpirun command).
When the SLAVE task launched by qrsh exits, the execd cleans up the entire job on that node - killing the job script running under the MASTER task.
If you have a job script with multiple mpirun's in a row, or some post-processing at the end of the script, they will never run (or worse, killed shortly after starting them).
Mark
Note: See
TracTickets for help on using
tickets.