[GE users] core binding with Sun Grid Engine

dagru d.gruber at sun.com
Mon Aug 24 08:52:02 BST 2009


Hi,

since the advent of complex multi-core CPUs and operating system support
for CPU/core binding, several Sun Grid Engine users want to exploit the
possibilities some operating systems offers.  You'll find a draft of the
job2core-binding
specification for the Solaris and Linux operating system appended to
this email.

Thank you for your feedback!

With kind regards,
Daniel



Job to core binding
-------------------

Version Comments                                                 Date    Author
----------------

1.0     Initial Version                                          08/13/2009  DG
1.1     Extending with definitions and architecture specifics    08/14/2009  DG
1.2     Added Solaris kstat support                              08/18/2009  DG
1.3     Added -binding linear:<amount>                           08/18/2009  DG
1.4     Added findings from meeting 08/19                        08/19/2009  DG
1.5     Added findings from meeting 08/20                        08/21/2009  DG

1 Introduction
--------------

With the advent of complex multi-core CPUs and NUMA architectures on cluster nodes,
the operating system scheduling is not always perfect for all kind of applications.
For parallel applications there are scenarios where it might be the best that the
processes/threads are distributed to different sockets available on the host, for
others it might be better to place them on a single socket running on different cores.

In the current Sun Grid Engine architecture there is just the concept of 'slots' but
no meaning if the slots are reflecting sockets, cores, or hardware supported threads.
Also performing core binding is currently unreflected in Sun Grid Engine. Until now
it is up to the administrator and/or user to enhance his/her applications to perform
such a binding.

This specification is an enhancement for Sun Grid Engine Solaris and Linux operating
system version. Reporting topology information and binding processes to specific
cores on hosts is the foundation for additional fine grained NUMA settings, like
the configuration of specific memory access patterns on application side.

1.1 Definitions
----------------

This section contains several definitions of terms frequently used in this
specification and with a specific meaning.

1.1.1 System topology
---------------------

Within this specification the term topology refers to the underlying hardware
structure of a Sun Grid Engine execution host. The topology describes the amount
of sockets the machine has (and are available) and the amount of cores (or threads
on SMT/CMT) each socket has. In case of a virtual machine, the topology of the
virtual machine is reported.


1.1.2 Core affinity or core binding
-----------------------------------

The term core affinity (and within the spec also core binding) refers to the
likelihood of a process to run on the same processor again after it was replaced
by the OS scheduler. The core affinity (or also called processor affinity) can
be influenced in Linux OS via a system call which takes a bitmask as parameter.
In this bitmask each bit reflects one core. If a core is turned off (via a
logical bit with the value 0) then the OS scheduler does avoid to migrate
the process to that core. Per default (i.e. without binding) all cores are turned
on, so that the process can be scheduled to an arbitrary core.
In the Solaris operating system processor sets can be used, which are defining
a set of processors on which only processes explicitly bound to this set
are able to run.

1.1.3 Collisions of core bindings
---------------------------------

Within this specification the term collision (of two or more core bindings)
refers to the circumstance that there exists at least one pair of processes
where both have set a (non-default) core affinity and both processes share
at least one core. Another source of a collision is when the administrator
allows just one process per socket (in order to avoid oversubscribing socket
related resources) and in addition to the running process on this socket a
second process wants to use free cores on this socket. The problem of
collisions are that cores or socket resources easily can be oversubscribed
resulting in degraded performance while other sockets or cores are unused.

1.2 Operating system dependent issues
-------------------------------------

1.2.1 Solaris specific behavior
-------------------------------

Sun Grid Engine supports currently the processor set feature from Solaris,
which needs additional administrator configuration. Once a processor set is
established it can be configured on PE level meaning all processes from PE
are running within this set. On the other side, each processor which is
assigned to a processor set will run only processes that are explicitly
bound to that processors set. The only exception is when a process
requires resources that are only available in the processor set then
it is allowed to use this resource. Not all available processors/cores can
be included in processor sets, at least one processor must remain available
for the operating system. The binding to a processor set is inherited across
fork and exec.

Solaris 9 and higher supports the concept of locality groups which builds
latency groups on NUMA systems. With this, topology related information in
terms of memory latency could be retrieved. But it is not possible to get
the actual amount of physical sockets and cores. For that the kernel kstat
structure has to be used.

Processor binding (binding LWP to a specific processor) can be performed
via the processor_bind system call. Bindings are inherited across fork and
exec system calls, but with this binding it is currently possible to bind
a process/thread on only one core, which is different to the Linux behavior
and can not be used (because of the danger of oversubscribing one core).
Therefore Solaris processor sets have to be used. The processor sets differs
from the Linux behavior in two important points: 1. Not all available
cores on a single machine can be used for core binding (at least one
core must be available for the OS). 2. The submitted job is running
exclusivly on the cores on which it is bound to. That means that no
unbound job is allowed to use these cores.


1.2.2 Linux specific behavior
------------------------------

The Linux scheduler supports inherently soft affinity (which is also called
natural affinity). This means the scheduler avoids process migrations from one
processing unit to another. In the 2.5 kernel hard processor affinity (meaning
that the scheduler can be told on which cores a specific process can run/not
to run) was introduced. Patches for 2.4 kernel are available (for the system
call sched_setaffinity(PID, afffinity bitmask)). Newer kernel versions are also
NUMA aware and memory access policies can be set with the libnuma library.
The Linux kernel includes a load_balancer component, which is called in specific
intervals (like every 200 ms) or when one run-queue is empty (pull migration).
Each processor has its own scheduler and own run-queue. The load_balancer
tries to equal the amount runnable tasks between these run-queue. This is done
via process migration.

Setting a specific core affinity/binding is done via a affinity bitmask, which
is accepted from the sched_setaffinity system-call a parameter. Example: 1011
means the process will be bound to the first, second, and forth core (the scheduler
only dispatches the process to the first, second, or fourth core even if the run-queue
of core three is empty). The default mask (without affinity) is 1111 (on a four core
machine) that means the scheduler can dispatch the process to any appropriate core.
Core affinity is inherited by child processes. But each processes can redefine the
affinity in any way.

The /proc/cpuinfo file contains information about the processor topology.
In order to simplify the access to the topology, which is different in different
kernel versions and Linux distributions, an external API is used as an
intermediate layer for this task. There were two APIs investigated,
the libtopology from INRIA Bordeaux and the PLPA (portable linux processor
affinity library) from the OpenMPI project. The libtopology offers support for
different operating systems and reports also memory settings where PLPA is more
lightweight and for Linux only. Because of the licence and approved stability,
the PLPA (which is used from several projects including OpenMPI itself) is
going to be used. With PLPA a simple mapping from logical <socket>,<core> pair
to the internal processor ID (which has to be used in order to set the bitmask)
can be done when the topology is supported. In order to support reporting the
availability of SMT, the proc filesystem is parsed additionally.

2 Project Overview
------------------

2.1 Project Aim
---------------

The goal is to provide more topology related information about the execution
hosts and to give the user the ability to bind his jobs to specific cores on
the execution system depending on the needs of the application.

2.2 Project Benefit
-------------------

Better performance of parallel applications. Depending on the core binding
strategy and system topology also limited energy saving could be achieved (by
just using a single socket for example, because some power-management is on socket
level).

3 System Architecture
---------------------

3.1 Configuration
-----------------

Sun Grid Engine gets different load values (static and non-static) out-of-the-box
from the execution hosts and reports them to the scheduler and user (which can be
displayed via qhost or qstat for example). Based on these load-values the user can
request resources and the scheduler makes its decisions. Currently there are no fine
grained load values regarding the specific host topology. In order to give the
user the ability to request specific hosts regarding their topology and/or to
request a special core affinity (core binding) for the jobs the following new
load values have to be introduced:

Load value 'topology': Reports the topology on Solaris hosts and supported
(depending on kernel version) Linux hosts otherwise 'NONE'. The topology is
a string value with the following syntax:

        <topology> ::= 'NONE' | <socket>
        <socket>   ::= 'S' <core> [ <socket> ]
        <core>     ::= 'C' [ <threads> ] [ <core> ]
        <threads>  ::= 'T' [ <threads> ]

Hence each 'S' means a socket and the following 'C's are the amount of cores.
Please be aware that this is enhanced on some architectures with additional
'T's (threads on SMT/CMT architectures) per core.

The topology string currently does not reflect the latency of memory to each
CPU (i.e. NUMA non-NUMA differentiation).

Examples:
"SCCSCC" means a 2 socket host with 2 cores on each socket.
"SCCCC" means a one socket machine with 4 cores.
"SCTTCTT" means a one socket machine with 2 cores and hyperthreading (Intels name for CMT).
"SCTTTTTTTTCTTTTTTTTCTTTTTTTTTCTTTTTTTTTCTTTTTTTTCTTTTTTTTCTTTTTTTTTCTTTTTTTTT" would be a
Sun T2 processor with 8 execution units all of them supporting 8 threads via chip logic.

Note: Depending on your host setup (BIOS or kernel parameters) a C could also mean
a thread on SMT/CMT system.

[Possible solution for core/thread differentiation: Introduce SMT/CMT static
load value which is per default 1 and when SMT is on 2 (or more depending on
the SMT/CMT processor architecture) which has to be configured by admin
regarding the BIOS/kernel settings. This could be used as a divisor.]

Load value 'm_socket': The total amount of sockets available on a machine.
If the machine has no supported topology it is equal to the existing "cpu"
load value.

Load value 'm_core': The total amount of cores available on a machine. If
the machine has no supported topology it is equal to the existing "cpu" value.

Load value 'm_thread': The total amount of threads the machine supports. In
case of CMT/SMT this could be a multiple of 'core's available. If the machine
has no supported topology it is equal to the existing 'cpu' value.

With the new load value 'm_core' the installation routine for execution hosts
is changed so that the 'slots' value is in the default case the amount of cores
found.

3.2 Functionality
-----------------

3.2.1 New "qsub -binding" parameter
-----------------------------------

In order to give the user the ability to request a special core affinity needed
for his/her application a new submission parameter ('-binding') has to be
introduced. With this parameter a special setup for the submitted job can be
requested. A specific core is described via a <socket_number>,<core_number>
tuple. Counting begins at 0 so that '0,1' describes the 2nd core on the first
socket.

To simplify the burden of generating long lists of socket-core-pairs the user
can request different strategies. A strategy describes the method how those pairs
should be created transparently by the system.

With the new submission (CLI qsub) parameter '-binding' different core binding
strategies could be selected. This should be used only in conjunction with exclusive
host access for this job. The binding is done on Linux hosts only. Other hosts
ignoring this so it us up to the user/admin to request a supported host. All
processes started on one host from the job script have the same binding mask
(which specifies which cores (not) to use). Doing a per-process binding is
error-prone but could be a future enhancement. When a PE job is started using
host exclusively the binding is applied to it but binding is applied also
when a normal job script or binary job is started. The amount of requested
slots does not restrict the amount of cores the process(es) are bound to
because without binding a process is bound to every available core.

The following new parameters are allowed (remark: the optional parameter
[env|pe|set] is described below):

'-binding [env|pe|set] linear:<amount>' := Sets the core binding so that
<amount> of successive cores on a single socket are tried to be used. If there
is no empty (in terms of already bound jobs) socket available an already partly
occupied socket which offers the amount of cores is used. If this does fail
consecutive cores on consecutive sockets are going to be used. If this fails no
binding is done.

'-binding [env|pe|set] linear:<amount>:[<socket>,]<core>' := Sets the core binding
for the process/script started by qsub to <amount> following cores starting
at core <core> on socket <socket>. If '<socket>,' is not set the socket number
is calculated out of the <core> number. Note that first core on first socket is
described as 0,0. If there is a misconfiguration (<amount> is too high, the [<socket>,]
<core> is unavailable, or a collision occurs no core binding is done.

'-binding [env|pe|set] striding:<firstcore>-<lastcore>:<n>' := Sets core binding
in a core striding manner meaning that with the beginning of <firstcore> all
cores to <lastcore> (including when possible) are used with a step-size of <n>.
If <n> is 2 then every second core is taken for example. If there are range
problems no mask is set at all. The core numbering is global, that means on
a 2 socket (with 4 cores each) system the core denoted as 4 is the first core
on the second socket (<1,0>).

'-binding [env|pe|set] striding:<socket>,<core>-<socket>,<core>:<n>' := Sets
core binding like before but with the difference that the first and last core
are specified as core numbers on a specific socket.

'-binding [env|pe|set] striding:<amount>:<n>' := Set the core binding like before
but does not use a specified core and socket number as the first one. The system
searches for <amount> cores with the offset <n> between them which are not
already bound. When no placement is found (because amount is too high or
in case of collisions) no binding is done.

Note that the core affinity mask is set for the script/process which is started
by Sun Grid Engine on the execution host. The mask is inherited by all child
threads/processes that means that all subprocesses and threads are using the
same set of cores. (In Linux OS child-processes are allowed to re-define the
core affinity or even use all cores).

The optional [env|pe|set] defines what is performed on execution side. 'env'
means an envirnonment variable containing the logical core numbers the system
determined is set. This is usually used from OpenMP applications in order to
determine on which cores the application is allowed to run. As an example for
Sun OpenMP applications the $SUNW_MP_PROCBIND environment variable can be
directly set with the content (qsub ... -v SUNW_MP_PROCBIND=$SGE_BINDING).

For OpenMPI jobs scattered on different hosts and using them exclusively the
input for a 'rankfile' could be produced by Sun Grid Engine. The 'rankfile'
reflects the binding strategy chosen at submission time by the user. For this
the pe_hostfile is extended in order to list the host:socket:core triple for
each MPI rank.

3.2.1 New "qhost" switch
------------------------

A new qhost switch is introduced which shows the amount of sockets, cores
and cpus as well as the topology string.

3.2.2 Extension of qconf -sep
-----------------------------

The qconf -sep command shows in addition to the hostname, amount of processors,
and architecture, the amount of sockets, cores and hardware supported threads.


3.2.3 Extension of qstat -r
---------------------------

The requested binding strategy is showed by qstat -r.

3.3 Implementation
------------------

The implementation for reporting sockets, cores, and topology is done via the
PLPA (portable linux processor affinity) library on the Linux operating system.
Each socket and core reflects a physically present and available socket or core.
Additionally the /proc filesystem is parsed in order to determine the availability
of SMT if possible. On the Solaris operating system the amount of sockets, cores
(and on some processors like the T2 also threads) are retrieved via the kernel
kstat structure.

The implementation of the -binding [linear|striding|explict] parameter means
enhancing the command line submission client qsub, the execution daemon, and
shepherd process which is forked for each job from the execution daemon. The
internal CULL structure JB_Type has to be enhanced with an additional list
(JB_binding) which contains the parameters. The execution daemon is writing
then the strategy into the "config" file (which is also enhanced by a new line)
for the shepherd. The shepherd is performing the binding when started to itself
(because it is sleeping most of the time) and the binding is then inherited
to the started processes (threads).

An internal data structure which reflects the current load in respect to used
threads, cores and sockets is held. The structure is similar to the topology
string but with the difference that execution entities which are currently
busy are shown as a dot. An example for a two socket machine with two cores
each and running one parallel job with 2 processes on the first socket would
be displayed as "...SCC" (the topology string is "SCCSCC").

Question: Should this string containing the 'free' part of the topology
also showed to the user in order to request it?

In order to extend the pe_hostfile with the choosen socket and core when
'-binding pe' was selected all requested hosts must have the same topology.

3.4 Other components/interfaces
-------------------------------

The DRMAA library has to accept the binding parameter within the native
specification.

Qmon is updated in order to reflect the new binding parameter.


3.5. Examples
-------------

Example 1: Show topology.
-------------------------

% qstat -F topology
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q at chamb                    BIPC  0/0/40         0.00     lx26-amd64
        hl:topology=SCCCC
---------------------------------------------------------------------------------
all.q at gally2                   BIPC  0/0/40         0.05     lx26-amd64
        hl:topology=NONE
---------------------------------------------------------------------------------
all.q at regen                    BIPC  0/0/40         0.25     lx26-amd64
        hl:topology=SCCSCC


Example 2: Show the amount of sockets on each execution host.
-------------------------------------------------------------

% qstat -F socket

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q at chamb                    BIPC  0/0/40         0.00     lx26-amd64
        hl:m_socket=1
---------------------------------------------------------------------------------
all.q at gally2                   BIPC  0/0/40         0.03     lx26-amd64
        hl:m_socket=0
---------------------------------------------------------------------------------
all.q at regen                    BIPC  0/0/40         0.20     lx26-amd64
        hl:m_socket=2



Example 3: Show the amount of cores on each execution host.
-----------------------------------------------------------

% qstat -F core

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q at chamb                    BIPC  0/0/40         0.00     lx26-amd64
        hl:m_core=4
---------------------------------------------------------------------------------
all.q at gally2                   BIPC  0/0/40         0.04     lx26-amd64
        hl:m_core=0
---------------------------------------------------------------------------------
all.q at regen                    BIPC  0/0/40         0.16     lx26-amd64
        hl:m_core=4


Example 4: Bind two jobs to different sockets.
----------------------------------------------

(In order to get user exclusive access to an host an advance reservation with a
parallel environment could be requested and submitted into.)

On a 2 socket (with 2 cores each) machine an OpenMP job is submitted to the first
socket (on the 2 cores) and an environment variable indicating the amount of threads
OpenMP should use for a job is set.

% qsub -pe testpe 4 -b y -binding linear:2:0,0 -v OMP_NUM_THREADS=4 \
        -l topology=SCCSCC /path/to/openmp_bin

("linear:2": get 2 cores in a row; ":0,0" beginning on first socket and first core there).

Bind the next job to the two cores on the other socket:

% qsub -pe testpe 4 -b y -binding linear:2:1,0 -v OMP_NUM_THREADS=4 \
        -l topology=SCCSCC /path/to/openmp_bin

("linear:2": get 2 cores in a row; ":1,0" beginning on 2nd socket and first core there).

The same could be also done by submitting the jobs with the same parameter twice:

% qsub -pe testpe 4 -b y -binding linear:2 -v OMP_NUM_THREADS=4 \
        -l topology=SCCSCC /path/to/openmp_bin
% qsub -pe testpe 4 -b y -binding linear:2 -v OMP_NUM_THREADS=4 \
        -l topology=SCCSCC /path/to/openmp_bin

Example 5: Allow the job to use only one core (the first one) on each of the two sockets.
-----------------------------------------------------------------------------------------

% qsub -pe testpe 2 -b y -binding striding:0-3:2 -v OMP_NUM_THREADS=2 \
        -l topology=SCCSCC /path/to/openmp_bin

("striding:0-3:2": beginning from core 0 to core 3 take every second core
[resulting in 0 and 2]).


4. Risks
---------

When the host is not requested exclusively and more jobs are using core binding,
collisions could occur which could lead to degraded performance (because of
oversubscribing one core when others have nothing to do). This is prevented
automatically by not performing the binding on the following jobs.
When the user requests a specific binding this could lead to better or
worse performance depending on the type of application. Hence binding
should only used by users knowing exactly the behaviour of their
applications and the details of the execution hosts. In most cases the
operating system scheduler is doing a good job, especially when hosts
are oversubscribed.

5. Future enhancements
----------------------

Topology aware scheduler. For this we need a clear concept of threads, cores,
sockets which is currently not available in the system and integrated the first
time with this specification. Let the scheduler select free hosts in terms of
the requested <socket>,<core> pairs as a soft or a hard request.

Support for mixed OpenMPI/OpenMP jobs.

Including other operating systems.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213897

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list