Opened 6 years ago

Last modified 13 months ago

#1479 reopened defect

Core binding problems for multiple tasks on same node

Reported by: markdixon Owned by: wish
Priority: normal Milestone:
Component: sge Version: 8.1.5
Severity: minor Keywords:
Cc: orion@…

Description

Hi,

I've been looking at core binding of MPI libraries and discovered a problem with how cores are allocated by the original core binding code, dating from version 6.2u<something>.

It would seem that the execd needs a slightly more sophisticated way of keeping track of cores in use.

The state held centrally by the execd is a string like "SCCCCScccc", showing what is in use but not keeping track of what cores are allocated to what jobs.

This becomes a problem when a job launches on a node and then qrsh's back into the same node (or multiple qrsh's are simultaneously used to access the same node by the same job). It would appear that:

  • When the first task (MASTER or SLAVE) starts on a node, the execd attempts to allocate the number of cores requested by the user, using the requested strategy.
  • When another (SLAVE) task starts on the same node, the execd again attempts to allocate the number of cores requested by the user, using the requested strategy.

In this situation, either more cores are bound to a job than there should be, or tasks are going unbound because other tasks have swiped all the cores first.

[I discovered this because IntelMPI currently uses qrsh to launch local ranks and I've not figured out a way to stop it yet]

I thought it would be useful to record it here :)

Mark
--


Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


Attachments (1)

signature.asc (819 bytes) - added by w.hay@… 3 years ago.
Added by email2trac

Download all attachments as: .zip

Change History (7)

comment:1 Changed 6 years ago by dlove

Hmm -- well spotted. (That may even explain some of Cliff's
observations.)

I guess the solution is to check the task's usage list to see if it
already has a binding, and use that if so.

comment:2 Changed 4 years ago by wish

  • Owner set to wish
  • Status changed from new to accepted

Also only deallocate the cores when the last pe task of a job using them terminates. Bonus fun
when doing a live upgrade.

Last edited 4 years ago by wish (previous) (diff)

comment:3 Changed 3 years ago by Dave Love <d.love@…>

  • Resolution set to fixed
  • Status changed from accepted to closed

In 4922/sge:

Fix #1479 (partially): Ensure all pe_tasks of a ja_task get the same core binding
Deallocation probably still needs attention.

comment:4 Changed 3 years ago by dlove

  • Resolution fixed deleted
  • Status changed from closed to reopened

Consider deallocation

comment:5 Changed 3 years ago by w.hay@…

On Fri, Sep 09, 2016 at 09:51:40AM +0000, SGE wrote:

#1479: Core binding problems for multiple tasks on same node


Reporter: markdixon | Owner: wish

Type: defect | Status: reopened

Priority: normal | Milestone:

Component: sge | Version: 8.1.5

Severity: minor | Resolution:
Keywords: |


Changes (by dlove):

  • status: closed => reopened
  • resolution: fixed =>

Comment:

Consider deallocation

So I stuck my name on this, didn't provide a patch and now dlove has a
partial fix. I've been trying to tackle this from the other end working
on deallocation first. The reasons for this are twofold:

i)Fixing allocation without deallocation could cause SoGE's to bind a
job to cores that another job has already been bound to. This would be
a pessimal binding rather than the merely non optimal bindings caused
by #1479. The circumstances that could cause SoGE's notion of which
cores are bound to be out of sync with reality are fairly rare but once
it happens it is likely to persist i think.

ii)Getting deallocation right seems to be the harder of two related
problem and therefore the one which should be tackled first so that the
solution to the simpler problem doesn't impose additional constraints
on the solution to the harder problem.

That said I've had my name on this for a long time and although I have the
start of a combined allocation/deallocation fix it still needs some work,
I'm working on it very slowly and this bug probably doesn't need both me
and dlove working on it.

So Dave, if the deallocation problem seems simple to you with your greater
knowledge of the code base feel free to change the owner on the ticket to
yourself otherwise I'll keep working on my combined fix.

William

signature.asc

Changed 3 years ago by w.hay@…

Added by email2trac

comment:6 Changed 13 months ago by opoplawski

  • Cc orion@… added
Note: See TracTickets for help on using tickets.