[GE users] core binding with Sun Grid Engine

dagru d.gruber at sun.com
Fri Aug 28 09:53:07 BST 2009

On 08/27/09 11:47, dagru wrote:
On 08/27/09 11:11, reuti wrote:


Am 27.08.2009 um 09:30 schrieb dagru:

On 08/25/09 23:57, reuti wrote:

Hi, just some thoughts I got about it: to me the -binding option
looks like putting much burden on the end user to know something
about each node's configuration. I would prefer in a first step to
have the option to define in a queue entry just "bind the
processes to a granted core yes/no". This would mean to start with
item 5 :-/ . This proposal looks like to be suited only for the
exclusive usage of nodes as I would also fear item 4, and for this
I wouldn't put too many features into it.

Integrating a queue entry in order to activate/deactivate binding
could be done,
so you don't have to fear item 4. But I'm not sure if you want to
'force' core binding
when it is activated. Because of the performance penalty the
'collisions' where
introduced and therefore only one job could be bound to the same core.
But anyway the problem with binding itself (on Linux) is always
there because
you don't need special privileges. Hence a user can enhance his job
in order to do what he/she wants at any time. From OS POV he/she
does only
make his program nicer. And you are right, currently is just
intended for exclusive
host usage - at the moment.

The syntax "-binding striding:0-3:2" could be more generic like in
an RQS: "-binding cores,sockets" "-binding 2,3" - two cores on any
three sockets (i.e. slots=6) "-binding 2,{*}" - two cores on each
socket in the machine "-binding 2,*" - two cores on any socket
(this will give only two slots here) "-binding {2},*" - two cores
on any socket (this will give only two slots here) and use the
socket exclusive even when there are 4 cores "-binding *,1" - all
free cores on any socket "-binding *,2" - all free cores on any
two sockets "-binding {*},2" - all cores on any two sockets (means
exclusive use of two complete sockets) "-binding {2},{*}" - two
cores on each socket and use node exclusive in the end

The syntax seems to cover a lot of requests, but you can not
specify a starting point
for example and a step-size (what we had initially in mind). Maybe
this is also not needed,
but there are some architectures where this could make some sense
(when striding is relaxed
to thread-striding on a T2 as example). Anyway we keep your
suggestion in mind.
Maybe this could be an additional, very flexible, request.

the striding is missing there by intention. I think the end user
won't know the details about the various AMD dual- or quad-core and
INTEL pseudo dual-dual-core and simple dual-core. I would prefer
specifying the strategy of distribution in the exechost/PE/queue
definition and hide it from the end user.

Putting the distribution strategy in a queue would lead to increase the number of
queues, which should be prevented. A PE is a good candidate. And we have it
already there since a long time for the Solaris OS, using the processor sets feature
Sorry, this is not true. I just saw that "processor" is indeed a queue attribute which
make it less useful.
in SGE. Unfortunately, through performance optimizations, it is not working in 6.2 and
above. But I hope it is 're-activated' in 62u4 again.
You are right, the user should not know about architecture details, but binding could
be used in conjunction with the JSV feature, where the JSV script puts all the details

BTW: What was the original idea of the "-binding [env|pe|set]"? I
mean the "env" and "set" entry, i.e. "-binding env 5"?

'set' is default when nothing is chosen and means that the shepherd (SGE) does perform
the binding itself. 'pe' would mean that either the pe_hostfile would be extended or a new
file would be generated (this is not clear yet) with <socket>,<core> pairs for each MPI rank
on the hosts according the strategy. The binding itself is then is done from MPI runtime env,
which can use a file (the so called 'rankfile)' generated by a starter script out of the pe_hostfile/otherfile.
No binding is done via SGE. And 'env' would mean that a SGE_??? environment variable
is generated with the core numbers the job have to be bound to. No binding from shepherd
is done. It is up to the application to use this variable in order to bind itself. The environment
variable approach is intended for MPI applications, which can bind threads on specific cores
The environment variable is for OMP of course (not (O)MPI) :)


according to such an environment variable. These three options are just an addition (prefix) for
the strategy, when none is chosen 'set' is done.


-- Reuti

To easy the things for the enduser, the -binding could be put in a
default request. But then we would need different default requests
for different queues/hosts for now. -- Reuti Am 24.08.2009 um
09:52 schrieb dagru:

Hi, since the advent of complex multi-core CPUs and operating
system support for CPU/core binding, several Sun Grid Engine
users want to exploit the possibilities some operating systems
offers. You'll find a draft of the job2core-binding specification
for the Solaris and Linux operating system appended to this
email. Thank you for your feedback! With kind regards, Daniel Job
to core binding ------------------- Version Comments Date Author
---------------- 1.0 Initial Version     08/13/2009 DG 1.1 Extending
with definitions and architecture specifics      08/14/2009 DG 1.2
Added Solaris kstat support 08/18/2009 DG 1.3   Added -binding
linear:<amount> 08/18/2009 DG 1.4 Added findings from meeting
08/19 08/19/2009 DG 1.5 Added findings from meeting 08/20
08/21/2009 DG 1 Introduction -------------- With the advent of
complex multi-core CPUs and NUMA architectures on cluster nodes,
the operating system scheduling is not always perfect for all
kind of applications. For parallel applications there are
scenarios where it might be the best that the processes/threads
are distributed to different sockets available on the host, for
others it might be better to place them on a single socket
running on different cores. In the current Sun Grid Engine
architecture there is just the concept of 'slots' but no meaning
if the slots are reflecting sockets, cores, or hardware supported
threads. Also performing core binding is currently unreflected in
Sun Grid Engine. Until now it is up to the administrator and/or
user to enhance his/her applications to perform such a binding.
This specification is an enhancement for Sun Grid Engine Solaris
and Linux operating system version. Reporting topology
information and binding processes to specific cores on hosts is
the foundation for additional fine grained NUMA settings, like
the configuration of specific memory access patterns on
application side. 1.1 Definitions ---------------- This section
contains several definitions of terms frequently used in this
specification and with a specific meaning. 1.1.1 System topology
--------------------- Within this specification the term topology
refers to the underlying hardware structure of a Sun Grid Engine
execution host. The topology describes the amount of sockets the
machine has (and are available) and the amount of cores (or
threads on SMT/CMT) each socket has. In case of a virtual
machine, the topology of the virtual machine is reported. 1.1.2
Core affinity or core binding -----------------------------------
The term core affinity (and within the spec also core binding)
refers to the likelihood of a process to run on the same
processor again after it was replaced by the OS scheduler. The
core affinity (or also called processor affinity) can be
influenced in Linux OS via a system call which takes a bitmask as
parameter. In this bitmask each bit reflects one core. If a core
is turned off (via a logical bit with the value 0) then the OS
scheduler does avoid to migrate the process to that core. Per
default (i.e. without binding) all cores are turned on, so that
the process can be scheduled to an arbitrary core. In the Solaris
operating system processor sets can be used, which are defining a
set of processors on which only processes explicitly bound to
this set are able to run. 1.1.3 Collisions of core bindings
--------------------------------- Within this specification the
term collision (of two or more core bindings) refers to the
circumstance that there exists at least one pair of processes
where both have set a (non-default) core affinity and both
processes share at least one core. Another source of a collision
is when the administrator allows just one process per socket (in
order to avoid oversubscribing socket related resources) and in
addition to the running process on this socket a second process
wants to use free cores on this socket. The problem of collisions
are that cores or socket resources easily can be oversubscribed
resulting in degraded performance while other sockets or cores
are unused. 1.2 Operating system dependent issues
------------------------------------- 1.2.1 Solaris specific
behavior ------------------------------- Sun Grid Engine supports
currently the processor set feature from   Solaris, which needs
additional administrator configuration. Once a processor set is
established it can be configured on PE level meaning all
processes from PE are running within this set. On the other side,
each processor which is assigned to a processor set will run only
processes that are explicitly bound to that processors set. The
only exception is when a process requires resources that are only
available in the processor set then it is allowed to use this
resource. Not all available processors/ cores can be included in
processor sets, at least one processor must remain available for
the operating system. The binding to a processor set is inherited
across fork and exec. Solaris 9 and higher supports the concept
of locality groups which builds latency groups on NUMA systems.
With this, topology related information in terms of memory
latency could be retrieved. But it is not possible to get the
actual amount of physical sockets and cores. For that the kernel
kstat structure has to be used. Processor binding (binding LWP to
a specific processor) can be performed via the processor_bind
system call. Bindings are inherited across fork and exec system
calls, but with this binding it is currently possible to bind a
process/thread on only one core, which is different to the Linux
behavior and can not be used (because of the danger of
oversubscribing one core). Therefore Solaris processor sets have
to be used. The processor sets differs from the Linux behavior in
two important points: 1. Not all available cores on a single
machine can be used for core binding (at least one core must be
available for the OS). 2. The submitted job is running exclusivly
on the cores on which it is bound to. That means that no unbound
job is allowed to use these cores. 1.2.2 Linux specific behavior
------------------------------ The Linux scheduler supports
inherently soft affinity (which is also called natural affinity).
This means the scheduler avoids process migrations from one
processing unit to another. In the 2.5 kernel hard processor
affinity (meaning that the scheduler can be told on which cores a
specific process can run/not to run) was introduced. Patches for
2.4 kernel are available (for the system call sched_setaffinity
(PID, afffinity bitmask)). Newer kernel versions are also NUMA
aware and memory access policies can be set with the libnuma
library. The Linux kernel includes a load_balancer component,
which is called in specific intervals (like every 200 ms) or when
one run-queue is empty (pull migration). Each processor has its
own scheduler and own run-queue. The load_balancer tries to equal
the amount runnable tasks between these run-queue. This is done
via process migration. Setting a specific core affinity/binding
is done via a affinity bitmask, which is accepted from the
sched_setaffinity system-call a parameter. Example: 1011 means
the process will be bound to the first, second, and forth core
(the scheduler only dispatches the process to the first, second,
or fourth core even if the run-queue of core three is empty). The
default mask (without affinity) is 1111 (on a four core machine)
that means the scheduler can dispatch the process to any
appropriate core. Core affinity is inherited by child processes.
But each processes can redefine the affinity in any way. The /
proc/cpuinfo file contains information about the processor
topology. In order to simplify the access to the topology, which
is different in different kernel versions and Linux
distributions, an external API is used as an intermediate layer
for this task. There were two APIs investigated, the libtopology
from INRIA Bordeaux and the PLPA (portable linux processor
affinity library) from the OpenMPI project. The libtopology
offers support for different operating systems and reports also
memory settings where PLPA is more lightweight and for Linux
only. Because of the licence and approved stability, the PLPA
(which is used from several projects including OpenMPI itself) is
going to be used. With PLPA a simple mapping from logical
<socket>,<core> pair to the internal processor ID (which has to
be used in order to set the bitmask) can be done when the
topology is supported. In order to support reporting the
availability of SMT, the proc filesystem is parsed additionally.
2 Project Overview ------------------ 2.1 Project Aim
--------------- The goal is to provide more topology related
information about the execution hosts and to give the user the
ability to bind his jobs to specific cores on the execution
system depending on the needs of the application. 2.2 Project
Benefit ------------------- Better performance of parallel
applications. Depending on the core binding strategy and system
topology also limited energy saving could be achieved (by just
using a single socket for example, because some power- management
is on socket level). 3 System Architecture ---------------------
3.1 Configuration ----------------- Sun Grid Engine gets
different load values (static and non-static) out-of-the-box from
the execution hosts and reports them to the scheduler and user
(which can be displayed via qhost or qstat for example). Based on
these load- values the user can request resources and the
scheduler makes its decisions. Currently there are no fine
grained load values regarding the specific host topology. In
order to give the user the ability to request specific hosts
regarding their topology and/or to request a special core
affinity (core binding) for the jobs the following new load
values have to be introduced: Load value 'topology': Reports the
topology on Solaris hosts and supported (depending on kernel
version) Linux hosts otherwise 'NONE'. The topology is a string
value with the following syntax: <topology> ::= 'NONE' | <socket>
<socket> ::= 'S' <core> [ <socket> ] <core> ::= 'C' [ <threads> ]
[ <core> ] <threads> ::= 'T' [ <threads> ] Hence each 'S' means a
socket and the following 'C's are the amount of cores. Please be
aware that this is enhanced on some architectures with additional
'T's (threads on SMT/CMT architectures) per core. The topology
string currently does not reflect the latency of memory to each
CPU (i.e. NUMA non-NUMA differentiation). Examples: "SCCSCC"
means a 2 socket host with 2 cores on each socket. "SCCCC" means
a one socket machine with 4 cores. "SCTTCTT" means a one socket
machine with 2 cores and hyperthreading (Intels name for CMT).
CT TTTTTTTT" would be a Sun T2 processor with 8 execution units
all of them supporting 8 threads via chip logic. Note: Depending
on your host setup (BIOS or kernel parameters) a C could also
mean a thread on SMT/CMT system. [Possible solution for core/
thread differentiation: Introduce SMT/ CMT static load value
which is per default 1 and when SMT is on 2 (or more depending on
the SMT/CMT processor architecture) which has to be configured by
admin regarding the BIOS/kernel settings. This could be used as a
divisor.] Load value 'm_socket': The total amount of sockets
available on a machine. If the machine has no supported topology
it is equal to the existing "cpu" load value. Load value
'm_core': The total amount of cores available on a machine. If
the machine has no supported topology it is equal to the existing
"cpu" value. Load value 'm_thread': The total amount of threads
the machine supports. In case of CMT/SMT this could be a multiple
of 'core's available. If the machine has no supported topology it
is equal to the existing 'cpu' value. With the new load value
'm_core' the installation routine for execution hosts is changed
so that the 'slots' value is in the default case the amount of
cores found. 3.2 Functionality ----------------- 3.2.1 New "qsub -
binding" parameter ----------------------------------- In order
to give the user the ability to request a special core affinity
needed for his/her application a new submission parameter ('-
binding') has to be introduced. With this parameter a special
setup for the submitted job can be requested. A specific core is
described via a <socket_number>,<core_number> tuple. Counting
begins at 0 so that '0,1' describes the 2nd core on the first
socket. To simplify the burden of generating long lists of socket-
core- pairs the user can request different strategies. A strategy
describes the method how those pairs should be created
transparently by the system. With the new submission (CLI qsub)
parameter '-binding' different core binding strategies could be
selected. This should be used only in conjunction with exclusive
host access for this job. The binding is done on Linux hosts
only. Other hosts ignoring this so it us up to the user/admin to
request a supported host. All processes started on one host from
the job script have the same binding mask (which specifies which
cores (not) to use). Doing a per-process binding is error-prone
but could be a future enhancement. When a PE job is started using
host exclusively the binding is applied to it but binding is
applied also when a normal job script or binary job is started.
The amount of requested slots does not restrict the amount of
cores the process(es) are bound to because without binding a
process is bound to every available core. The following new
parameters are allowed (remark: the optional parameter [env|pe|
set] is described below): '-binding [env|pe|set]
linear:<amount>' := Sets the core binding so that <amount> of
successive cores on a single socket are tried to be used. If
there is no empty (in terms of already bound jobs) socket
available an already partly occupied socket which offers the
amount of cores is used. If this does fail consecutive cores on
consecutive sockets are going to be used. If this fails no
binding is done. '-binding [env|pe|set] linear:<amount>:
[<socket>,]<core>' := Sets the core binding for the process/
script started by qsub to <amount> following cores starting at
core <core> on socket <socket>. If '<socket>,' is not set the
socket number is calculated out of the <core> number. Note that
first core on first socket is described as 0,0. If there is a
misconfiguration (<amount> is too high, the [<socket>,] <core> is
unavailable, or a collision occurs no core binding is done. '-
binding [env|pe|set] striding:<firstcore>-<lastcore>:<n>' := Sets
core binding in a core striding manner meaning that with the
beginning of <firstcore> all cores to <lastcore> (including when
possible) are used with a step- size of <n>. If <n> is 2 then
every second core is taken for example. If there are range
problems no mask is set at all. The core numbering is global,
that means on a 2 socket (with 4 cores each) system the core
denoted as 4 is the first core on the second socket (<1,0>). '-
binding [env|pe|set] striding:<socket>,<core>-
<socket>,<core>:<n>' := Sets core binding like before but with
the difference that the first and last core are specified as core
numbers on a specific socket. '-binding [env|pe|set]
striding:<amount>:<n>' := Set the core binding like before but
does not use a specified core and socket number as the first one.
The system searches for <amount> cores with the offset <n>
between them which are not already bound. When no placement is
found (because amount is too high or in case of collisions) no
binding is done. Note that the core affinity mask is set for the
script/process which is started by Sun Grid Engine on the
execution host. The mask is inherited by all child threads/
processes that means that all subprocesses and threads are using
the same set of cores. (In Linux OS child-processes are allowed
to re- define the core affinity or even use all cores). The
optional [env|pe|set] defines what is performed on execution
side. 'env' means an envirnonment variable containing the logical
core numbers the system determined is set. This is usually used
from OpenMP applications in order to determine on which cores the
application is allowed to run. As an example for Sun OpenMP
applications the $SUNW_MP_PROCBIND environment variable can be
directly set with the content (qsub ... -v SUNW_MP_PROCBIND=
$SGE_BINDING). For OpenMPI jobs scattered on different hosts and
using them exclusively the input for a 'rankfile' could be
produced by Sun Grid Engine. The 'rankfile' reflects the binding
strategy chosen at submission time by the user. For this the
pe_hostfile is extended in order to list the host:socket:core
triple for each MPI rank. 3.2.1 New "qhost" switch
------------------------ A new qhost switch is introduced which
shows the amount of sockets, cores and cpus as well as the
topology string. 3.2.2 Extension of qconf -sep
----------------------------- The qconf -sep command shows in
addition to the hostname, amount of processors, and architecture,
the amount of sockets, cores and hardware supported threads.
3.2.3 Extension of qstat -r --------------------------- The
requested binding strategy is showed by qstat -r. 3.3
Implementation ------------------ The implementation for
reporting sockets, cores, and topology is done via the PLPA
(portable linux processor affinity) library on the Linux
operating system. Each socket and core reflects a physically
present and available socket or core. Additionally the /proc
filesystem is parsed in order to determine the availability of
SMT if possible. On the Solaris operating system the amount of
sockets, cores (and on some processors like the T2 also threads)
are retrieved via the kernel kstat structure. The implementation
of the -binding [linear|striding|explict] parameter means
enhancing the command line submission client qsub, the execution
daemon, and shepherd process which is forked for each job from
the execution daemon. The internal CULL structure JB_Type has to
be enhanced with an additional list (JB_binding) which contains
the parameters. The execution daemon is writing then the strategy
into the "config" file (which is also enhanced by a new line) for
the shepherd. The shepherd is performing the binding when started
to itself (because it is sleeping most of the time) and the
binding is then inherited to the started processes (threads). An
internal data structure which reflects the current load in
respect to used threads, cores and sockets is held. The structure
is similar to the topology string but with the difference that
execution entities which are currently busy are shown as a dot.
An example for a two socket machine with two cores each and
running one parallel job with 2 processes on the first socket
would be displayed as "...SCC" (the topology string is "SCCSCC").
Question: Should this string containing the 'free' part of the
topology also showed to the user in order to request it? In order
to extend the pe_hostfile with the choosen socket and core when '-
binding pe' was selected all requested hosts must have the same
topology. 3.4 Other components/interfaces
------------------------------- The DRMAA library has to accept
the binding parameter within the native specification. Qmon is
updated in order to reflect the new binding parameter. 3.5.
Examples ------------- Example 1: Show topology.
------------------------- % qstat -F topology queuename qtype
resv/used/tot. load_avg arch states
-- ----------- all.q at chamb BIPC 0/0/40 0.00 lx26- amd64
-- ----------- all.q at gally2 BIPC 0/0/40 0.05 lx26- amd64
-- ----------- all.q at regen BIPC 0/0/40 0.25 lx26- amd64
hl:topology=SCCSCC Example 2: Show the amount of sockets on each
execution host.
------------------------------------------------------------- %
qstat -F socket queuename qtype resv/used/tot. load_avg arch
-- ----------- all.q at chamb BIPC 0/0/40 0.00 lx26- amd64
-- ----------- all.q at gally2 BIPC 0/0/40 0.03 lx26- amd64
-- ----------- all.q at regen BIPC 0/0/40 0.20 lx26- amd64
hl:m_socket=2 Example 3: Show the amount of cores on each
execution host.
----------------------------------------------------------- %
qstat -F core queuename qtype resv/used/tot. load_avg arch states
-- ----------- all.q at chamb BIPC 0/0/40 0.00 lx26- amd64
-- ----------- all.q at gally2 BIPC 0/0/40 0.04 lx26- amd64
-- ----------- all.q at regen BIPC 0/0/40 0.16 lx26- amd64
hl:m_core=4 Example 4: Bind two jobs to different sockets.
---------------------------------------------- (In order to get
user exclusive access to an host an advance reservation with a
parallel environment could be requested and submitted into.) On a
2 socket (with 2 cores each) machine an OpenMP job is submitted
to the first socket (on the 2 cores) and an environment variable
indicating the amount of threads OpenMP should use for a job is
set. % qsub -pe testpe 4 -b y -binding linear:2:0,0 -v
OMP_NUM_THREADS=4 \ -l topology=SCCSCC /path/to/openmp_bin
("linear:2": get 2 cores in a row; ":0,0" beginning on first
socket and first core there). Bind the next job to the two cores
on the other socket: % qsub -pe testpe 4 -b y -binding linear:
2:1,0 -v OMP_NUM_THREADS=4 \ -l topology=SCCSCC /path/to/
openmp_bin ("linear:2": get 2 cores in a row; ":1,0" beginning on
2nd socket and first core there). The same could be also done by
submitting the jobs with the same parameter twice: % qsub -pe
testpe 4 -b y -binding linear:2 -v OMP_NUM_THREADS=4 \ -l
topology=SCCSCC /path/to/openmp_bin % qsub -pe testpe 4 -b y -
binding linear:2 -v OMP_NUM_THREADS=4 \ -l topology=SCCSCC /path/
to/openmp_bin Example 5: Allow the job to use only one core (the
first one) on each of the two sockets.
-- ------------------- % qsub -pe testpe 2 -b y -binding striding:
0-3:2 -v OMP_NUM_THREADS=2 \ -l topology=SCCSCC /path/to/
openmp_bin ("striding:0-3:2": beginning from core 0 to core 3
take every second core [resulting in 0 and 2]). 4. Risks
--------- When the host is not requested exclusively and more
jobs are using core binding, collisions could occur which could
lead to degraded performance (because of oversubscribing one core
when others have nothing to do). This is prevented automatically
by not performing the binding on the following jobs. When the
user requests a specific binding this could lead to better or
worse performance depending on the type of application. Hence
binding should only used by users knowing exactly the behaviour
of their applications and the details of the execution hosts. In
most cases the operating system scheduler is doing a good job,
especially when hosts are oversubscribed. 5. Future enhancements
---------------------- Topology aware scheduler. For this we need
a clear concept of threads, cores, sockets which is currently not
available in the system and integrated the first time with this
specification. Let the scheduler select free hosts in terms of
the requested <socket>,<core> pairs as a soft or a hard request.
Support for mixed OpenMPI/OpenMP jobs. Including other operating
systems. ------------------------------------------------------
dsForumId=38&dsMessageId=213897 To unsubscribe from this
discussion, e-mail: [users- unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].

------------------------------------------------------ http://
dsForumId=38&dsMessageId=214256 To unsubscribe from this
discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].

More information about the gridengine-users mailing list