[GE users] core binding with Sun Grid Engine

dagru d.gruber at sun.com
Thu Aug 27 08:30:36 BST 2009


Hi Reuti,

On 08/25/09 23:57, reuti wrote:

Hi,

just some thoughts I got about it: to me the -binding option looks
like putting much burden on the end user to know something about each
node's configuration. I would prefer in a first step to have the
option to define in a queue entry just "bind the processes to a
granted core yes/no". This would mean to start with item 5 :-/ . This
proposal looks like to be suited only for the exclusive usage of
nodes as I would also fear item 4, and for this I wouldn't put too
many features into it.


Integrating a queue entry in order to activate/deactivate binding could be done,
so you don't have to fear item 4. But I'm not sure if you want to 'force' core binding
when it is activated. Because of the performance penalty the 'collisions' where
introduced and therefore only one job could be bound to the same core.
But anyway the problem with binding itself (on Linux) is always there because
you don't need special privileges. Hence a user can enhance his job script/program
in order to do what he/she wants at any time. From OS POV he/she does only
make his program nicer. And you are right, currently is just intended for exclusive
host usage - at the moment.


The syntax "-binding striding:0-3:2" could be more generic like in an
RQS: "-binding cores,sockets"

"-binding 2,3" - two cores on any three sockets (i.e. slots=6)
"-binding 2,{*}" - two cores on each socket in the machine
"-binding 2,*" - two cores on any socket (this will give only two
slots here)
"-binding {2},*" - two cores on any socket (this will give only two
slots here) and use the socket exclusive even when there are 4 cores
"-binding *,1" - all free cores on any socket
"-binding *,2" - all free cores on any two sockets
"-binding {*},2" - all cores on any two sockets (means exclusive use
of two complete sockets)
"-binding {2},{*}" - two cores on each socket and use node exclusive
in the end


The syntax seems to cover a lot of requests, but you can not specify a starting point
for example and a step-size (what we had initially in mind). Maybe this is also not needed,
but there are some architectures where this could make some sense (when striding is relaxed
to thread-striding on a T2 as example). Anyway we keep your suggestion in mind.
Maybe this could be an additional, very flexible, request.


To easy the things for the enduser, the -binding could be put in a
default request. But then we would need different default requests
for different queues/hosts for now.

-- Reuti


Am 24.08.2009 um 09:52 schrieb dagru:



Hi,

since the advent of complex multi-core CPUs and operating system
support
for CPU/core binding, several Sun Grid Engine users want to exploit
the
possibilities some operating systems offers.  You'll find a draft
of the
job2core-binding
specification for the Solaris and Linux operating system appended to
this email.

Thank you for your feedback!

With kind regards,
Daniel



Job to core binding
-------------------

Version Comments                                                 Date    Author
----------------

1.0     Initial Version                                          08/13/2009  DG
1.1     Extending with definitions and architecture specifics
08/14/2009  DG
1.2     Added Solaris kstat support                              08/18/2009  DG
1.3     Added -binding linear:<amount>                           08/18/2009  DG
1.4     Added findings from meeting 08/19                        08/19/2009  DG
1.5     Added findings from meeting 08/20
08/21/2009  DG

1 Introduction
--------------

With the advent of complex multi-core CPUs and NUMA architectures
on cluster nodes,
the operating system scheduling is not always perfect for all kind
of applications.
For parallel applications there are scenarios where it might be the
best that the
processes/threads are distributed to different sockets available on
the host, for
others it might be better to place them on a single socket running
on different cores.

In the current Sun Grid Engine architecture there is just the
concept of 'slots' but
no meaning if the slots are reflecting sockets, cores, or hardware
supported threads.
Also performing core binding is currently unreflected in Sun Grid
Engine. Until now
it is up to the administrator and/or user to enhance his/her
applications to perform
such a binding.

This specification is an enhancement for Sun Grid Engine Solaris
and Linux operating
system version. Reporting topology information and binding
processes to specific
cores on hosts is the foundation for additional fine grained NUMA
settings, like
the configuration of specific memory access patterns on application
side.

1.1 Definitions
----------------

This section contains several definitions of terms frequently used
in this
specification and with a specific meaning.

1.1.1 System topology
---------------------

Within this specification the term topology refers to the
underlying hardware
structure of a Sun Grid Engine execution host. The topology
describes the amount
of sockets the machine has (and are available) and the amount of
cores (or threads
on SMT/CMT) each socket has. In case of a virtual machine, the
topology of the
virtual machine is reported.


1.1.2 Core affinity or core binding
-----------------------------------

The term core affinity (and within the spec also core binding)
refers to the
likelihood of a process to run on the same processor again after it
was replaced
by the OS scheduler. The core affinity (or also called processor
affinity) can
be influenced in Linux OS via a system call which takes a bitmask
as parameter.
In this bitmask each bit reflects one core. If a core is turned off
(via a
logical bit with the value 0) then the OS scheduler does avoid to
migrate
the process to that core. Per default (i.e. without binding) all
cores are turned
on, so that the process can be scheduled to an arbitrary core.
In the Solaris operating system processor sets can be used, which
are defining
a set of processors on which only processes explicitly bound to
this set
are able to run.

1.1.3 Collisions of core bindings
---------------------------------

Within this specification the term collision (of two or more core
bindings)
refers to the circumstance that there exists at least one pair of
processes
where both have set a (non-default) core affinity and both
processes share
at least one core. Another source of a collision is when the
administrator
allows just one process per socket (in order to avoid
oversubscribing socket
related resources) and in addition to the running process on this
socket a
second process wants to use free cores on this socket. The problem of
collisions are that cores or socket resources easily can be
oversubscribed
resulting in degraded performance while other sockets or cores are
unused.

1.2 Operating system dependent issues
-------------------------------------

1.2.1 Solaris specific behavior
-------------------------------

Sun Grid Engine supports currently the processor set feature from
Solaris,
which needs additional administrator configuration. Once a
processor set is
established it can be configured on PE level meaning all processes
from PE
are running within this set. On the other side, each processor
which is
assigned to a processor set will run only processes that are
explicitly
bound to that processors set. The only exception is when a process
requires resources that are only available in the processor set then
it is allowed to use this resource. Not all available processors/
cores can
be included in processor sets, at least one processor must remain
available
for the operating system. The binding to a processor set is
inherited across
fork and exec.

Solaris 9 and higher supports the concept of locality groups which
builds
latency groups on NUMA systems. With this, topology related
information in
terms of memory latency could be retrieved. But it is not possible
to get
the actual amount of physical sockets and cores. For that the
kernel kstat
structure has to be used.

Processor binding (binding LWP to a specific processor) can be
performed
via the processor_bind system call. Bindings are inherited across
fork and
exec system calls, but with this binding it is currently possible
to bind
a process/thread on only one core, which is different to the Linux
behavior
and can not be used (because of the danger of oversubscribing one
core).
Therefore Solaris processor sets have to be used. The processor
sets differs
from the Linux behavior in two important points: 1. Not all available
cores on a single machine can be used for core binding (at least one
core must be available for the OS). 2. The submitted job is running
exclusivly on the cores on which it is bound to. That means that no
unbound job is allowed to use these cores.


1.2.2 Linux specific behavior
------------------------------

The Linux scheduler supports inherently soft affinity (which is
also called
natural affinity). This means the scheduler avoids process
migrations from one
processing unit to another. In the 2.5 kernel hard processor
affinity (meaning
that the scheduler can be told on which cores a specific process
can run/not
to run) was introduced. Patches for 2.4 kernel are available (for
the system
call sched_setaffinity(PID, afffinity bitmask)). Newer kernel
versions are also
NUMA aware and memory access policies can be set with the libnuma
library.
The Linux kernel includes a load_balancer component, which is
called in specific
intervals (like every 200 ms) or when one run-queue is empty (pull
migration).
Each processor has its own scheduler and own run-queue. The
load_balancer
tries to equal the amount runnable tasks between these run-queue.
This is done
via process migration.

Setting a specific core affinity/binding is done via a affinity
bitmask, which
is accepted from the sched_setaffinity system-call a parameter.
Example: 1011
means the process will be bound to the first, second, and forth
core (the scheduler
only dispatches the process to the first, second, or fourth core
even if the run-queue
of core three is empty). The default mask (without affinity) is
1111 (on a four core
machine) that means the scheduler can dispatch the process to any
appropriate core.
Core affinity is inherited by child processes. But each processes
can redefine the
affinity in any way.

The /proc/cpuinfo file contains information about the processor
topology.
In order to simplify the access to the topology, which is different
in different
kernel versions and Linux distributions, an external API is used as an
intermediate layer for this task. There were two APIs investigated,
the libtopology from INRIA Bordeaux and the PLPA (portable linux
processor
affinity library) from the OpenMPI project. The libtopology offers
support for
different operating systems and reports also memory settings where
PLPA is more
lightweight and for Linux only. Because of the licence and approved
stability,
the PLPA (which is used from several projects including OpenMPI
itself) is
going to be used. With PLPA a simple mapping from logical
<socket>,<core> pair
to the internal processor ID (which has to be used in order to set
the bitmask)
can be done when the topology is supported. In order to support
reporting the
availability of SMT, the proc filesystem is parsed additionally.

2 Project Overview
------------------

2.1 Project Aim
---------------

The goal is to provide more topology related information about the
execution
hosts and to give the user the ability to bind his jobs to specific
cores on
the execution system depending on the needs of the application.

2.2 Project Benefit
-------------------

Better performance of parallel applications. Depending on the core
binding
strategy and system topology also limited energy saving could be
achieved (by
just using a single socket for example, because some power-
management is on socket
level).

3 System Architecture
---------------------

3.1 Configuration
-----------------

Sun Grid Engine gets different load values (static and non-static)
out-of-the-box
from the execution hosts and reports them to the scheduler and user
(which can be
displayed via qhost or qstat for example). Based on these load-
values the user can
request resources and the scheduler makes its decisions. Currently
there are no fine
grained load values regarding the specific host topology. In order
to give the
user the ability to request specific hosts regarding their topology
and/or to
request a special core affinity (core binding) for the jobs the
following new
load values have to be introduced:

Load value 'topology': Reports the topology on Solaris hosts and
supported
(depending on kernel version) Linux hosts otherwise 'NONE'. The
topology is
a string value with the following syntax:

        <topology> ::= 'NONE' | <socket>
        <socket>   ::= 'S' <core> [ <socket> ]
        <core>     ::= 'C' [ <threads> ] [ <core> ]
        <threads>  ::= 'T' [ <threads> ]

Hence each 'S' means a socket and the following 'C's are the amount
of cores.
Please be aware that this is enhanced on some architectures with
additional
'T's (threads on SMT/CMT architectures) per core.

The topology string currently does not reflect the latency of
memory to each
CPU (i.e. NUMA non-NUMA differentiation).

Examples:
"SCCSCC" means a 2 socket host with 2 cores on each socket.
"SCCCC" means a one socket machine with 4 cores.
"SCTTCTT" means a one socket machine with 2 cores and
hyperthreading (Intels name for CMT).
"SCTTTTTTTTCTTTTTTTTCTTTTTTTTTCTTTTTTTTTCTTTTTTTTCTTTTTTTTCTTTTTTTTTCT
TTTTTTTT" would be a
Sun T2 processor with 8 execution units all of them supporting 8
threads via chip logic.

Note: Depending on your host setup (BIOS or kernel parameters) a C
could also mean
a thread on SMT/CMT system.

[Possible solution for core/thread differentiation: Introduce SMT/
CMT static
load value which is per default 1 and when SMT is on 2 (or more
depending on
the SMT/CMT processor architecture) which has to be configured by
admin
regarding the BIOS/kernel settings. This could be used as a divisor.]

Load value 'm_socket': The total amount of sockets available on a
machine.
If the machine has no supported topology it is equal to the
existing "cpu"
load value.

Load value 'm_core': The total amount of cores available on a
machine. If
the machine has no supported topology it is equal to the existing
"cpu" value.

Load value 'm_thread': The total amount of threads the machine
supports. In
case of CMT/SMT this could be a multiple of 'core's available. If
the machine
has no supported topology it is equal to the existing 'cpu' value.

With the new load value 'm_core' the installation routine for
execution hosts
is changed so that the 'slots' value is in the default case the
amount of cores
found.

3.2 Functionality
-----------------

3.2.1 New "qsub -binding" parameter
-----------------------------------

In order to give the user the ability to request a special core
affinity needed
for his/her application a new submission parameter ('-binding') has
to be
introduced. With this parameter a special setup for the submitted
job can be
requested. A specific core is described via a
<socket_number>,<core_number>
tuple. Counting begins at 0 so that '0,1' describes the 2nd core on
the first
socket.

To simplify the burden of generating long lists of socket-core-
pairs the user
can request different strategies. A strategy describes the method
how those pairs
should be created transparently by the system.

With the new submission (CLI qsub) parameter '-binding' different
core binding
strategies could be selected. This should be used only in
conjunction with exclusive
host access for this job. The binding is done on Linux hosts only.
Other hosts
ignoring this so it us up to the user/admin to request a supported
host. All
processes started on one host from the job script have the same
binding mask
(which specifies which cores (not) to use). Doing a per-process
binding is
error-prone but could be a future enhancement. When a PE job is
started using
host exclusively the binding is applied to it but binding is
applied also
when a normal job script or binary job is started. The amount of
requested
slots does not restrict the amount of cores the process(es) are
bound to
because without binding a process is bound to every available core.

The following new parameters are allowed (remark: the optional
parameter
[env|pe|set] is described below):

'-binding [env|pe|set] linear:<amount>' := Sets the core binding so
that
<amount> of successive cores on a single socket are tried to be
used. If there
is no empty (in terms of already bound jobs) socket available an
already partly
occupied socket which offers the amount of cores is used. If this
does fail
consecutive cores on consecutive sockets are going to be used. If
this fails no
binding is done.

'-binding [env|pe|set] linear:<amount>:[<socket>,]<core>' := Sets
the core binding
for the process/script started by qsub to <amount> following cores
starting
at core <core> on socket <socket>. If '<socket>,' is not set the
socket number
is calculated out of the <core> number. Note that first core on
first socket is
described as 0,0. If there is a misconfiguration (<amount> is too
high, the [<socket>,]
<core> is unavailable, or a collision occurs no core binding is done.

'-binding [env|pe|set] striding:<firstcore>-<lastcore>:<n>' := Sets
core binding
in a core striding manner meaning that with the beginning of
<firstcore> all
cores to <lastcore> (including when possible) are used with a step-
size of <n>.
If <n> is 2 then every second core is taken for example. If there
are range
problems no mask is set at all. The core numbering is global, that
means on
a 2 socket (with 4 cores each) system the core denoted as 4 is the
first core
on the second socket (<1,0>).

'-binding [env|pe|set] striding:<socket>,<core>-
<socket>,<core>:<n>' := Sets
core binding like before but with the difference that the first and
last core
are specified as core numbers on a specific socket.

'-binding [env|pe|set] striding:<amount>:<n>' := Set the core
binding like before
but does not use a specified core and socket number as the first
one. The system
searches for <amount> cores with the offset <n> between them which
are not
already bound. When no placement is found (because amount is too
high or
in case of collisions) no binding is done.

Note that the core affinity mask is set for the script/process
which is started
by Sun Grid Engine on the execution host. The mask is inherited by
all child
threads/processes that means that all subprocesses and threads are
using the
same set of cores. (In Linux OS child-processes are allowed to re-
define the
core affinity or even use all cores).

The optional [env|pe|set] defines what is performed on execution
side. 'env'
means an envirnonment variable containing the logical core numbers
the system
determined is set. This is usually used from OpenMP applications in
order to
determine on which cores the application is allowed to run. As an
example for
Sun OpenMP applications the $SUNW_MP_PROCBIND environment variable
can be
directly set with the content (qsub ... -v SUNW_MP_PROCBIND=
$SGE_BINDING).

For OpenMPI jobs scattered on different hosts and using them
exclusively the
input for a 'rankfile' could be produced by Sun Grid Engine. The
'rankfile'
reflects the binding strategy chosen at submission time by the
user. For this
the pe_hostfile is extended in order to list the host:socket:core
triple for
each MPI rank.

3.2.1 New "qhost" switch
------------------------

A new qhost switch is introduced which shows the amount of sockets,
cores
and cpus as well as the topology string.

3.2.2 Extension of qconf -sep
-----------------------------

The qconf -sep command shows in addition to the hostname, amount of
processors,
and architecture, the amount of sockets, cores and hardware
supported threads.


3.2.3 Extension of qstat -r
---------------------------

The requested binding strategy is showed by qstat -r.

3.3 Implementation
------------------

The implementation for reporting sockets, cores, and topology is
done via the
PLPA (portable linux processor affinity) library on the Linux
operating system.
Each socket and core reflects a physically present and available
socket or core.
Additionally the /proc filesystem is parsed in order to determine
the availability
of SMT if possible. On the Solaris operating system the amount of
sockets, cores
(and on some processors like the T2 also threads) are retrieved via
the kernel
kstat structure.

The implementation of the -binding [linear|striding|explict]
parameter means
enhancing the command line submission client qsub, the execution
daemon, and
shepherd process which is forked for each job from the execution
daemon. The
internal CULL structure JB_Type has to be enhanced with an
additional list
(JB_binding) which contains the parameters. The execution daemon is
writing
then the strategy into the "config" file (which is also enhanced by
a new line)
for the shepherd. The shepherd is performing the binding when
started to itself
(because it is sleeping most of the time) and the binding is then
inherited
to the started processes (threads).

An internal data structure which reflects the current load in
respect to used
threads, cores and sockets is held. The structure is similar to the
topology
string but with the difference that execution entities which are
currently
busy are shown as a dot. An example for a two socket machine with
two cores
each and running one parallel job with 2 processes on the first
socket would
be displayed as "...SCC" (the topology string is "SCCSCC").

Question: Should this string containing the 'free' part of the
topology
also showed to the user in order to request it?

In order to extend the pe_hostfile with the choosen socket and core
when
'-binding pe' was selected all requested hosts must have the same
topology.

3.4 Other components/interfaces
-------------------------------

The DRMAA library has to accept the binding parameter within the
native
specification.

Qmon is updated in order to reflect the new binding parameter.


3.5. Examples
-------------

Example 1: Show topology.
-------------------------

% qstat -F topology
queuename                      qtype resv/used/tot. load_avg
arch          states
----------------------------------------------------------------------
-----------
all.q at chamb                    BIPC  0/0/40         0.00     lx26-
amd64
        hl:topology=SCCCC
----------------------------------------------------------------------
-----------
all.q at gally2                   BIPC  0/0/40         0.05     lx26-
amd64
        hl:topology=NONE
----------------------------------------------------------------------
-----------
all.q at regen                    BIPC  0/0/40         0.25     lx26-
amd64
        hl:topology=SCCSCC


Example 2: Show the amount of sockets on each execution host.
-------------------------------------------------------------

% qstat -F socket

queuename                      qtype resv/used/tot. load_avg
arch          states
----------------------------------------------------------------------
-----------
all.q at chamb                    BIPC  0/0/40         0.00     lx26-
amd64
        hl:m_socket=1
----------------------------------------------------------------------
-----------
all.q at gally2                   BIPC  0/0/40         0.03     lx26-
amd64
        hl:m_socket=0
----------------------------------------------------------------------
-----------
all.q at regen                    BIPC  0/0/40         0.20     lx26-
amd64
        hl:m_socket=2



Example 3: Show the amount of cores on each execution host.
-----------------------------------------------------------

% qstat -F core

queuename                      qtype resv/used/tot. load_avg
arch          states
----------------------------------------------------------------------
-----------
all.q at chamb                    BIPC  0/0/40         0.00     lx26-
amd64
        hl:m_core=4
----------------------------------------------------------------------
-----------
all.q at gally2                   BIPC  0/0/40         0.04     lx26-
amd64
        hl:m_core=0
----------------------------------------------------------------------
-----------
all.q at regen                    BIPC  0/0/40         0.16     lx26-
amd64
        hl:m_core=4


Example 4: Bind two jobs to different sockets.
----------------------------------------------

(In order to get user exclusive access to an host an advance
reservation with a
parallel environment could be requested and submitted into.)

On a 2 socket (with 2 cores each) machine an OpenMP job is
submitted to the first
socket (on the 2 cores) and an environment variable indicating the
amount of threads
OpenMP should use for a job is set.

% qsub -pe testpe 4 -b y -binding linear:2:0,0 -v OMP_NUM_THREADS=4 \
        -l topology=SCCSCC /path/to/openmp_bin

("linear:2": get 2 cores in a row; ":0,0" beginning on first socket
and first core there).

Bind the next job to the two cores on the other socket:

% qsub -pe testpe 4 -b y -binding linear:2:1,0 -v OMP_NUM_THREADS=4 \
        -l topology=SCCSCC /path/to/openmp_bin

("linear:2": get 2 cores in a row; ":1,0" beginning on 2nd socket
and first core there).

The same could be also done by submitting the jobs with the same
parameter twice:

% qsub -pe testpe 4 -b y -binding linear:2 -v OMP_NUM_THREADS=4 \
        -l topology=SCCSCC /path/to/openmp_bin
% qsub -pe testpe 4 -b y -binding linear:2 -v OMP_NUM_THREADS=4 \
        -l topology=SCCSCC /path/to/openmp_bin

Example 5: Allow the job to use only one core (the first one) on
each of the two sockets.
----------------------------------------------------------------------
-------------------

% qsub -pe testpe 2 -b y -binding striding:0-3:2 -v
OMP_NUM_THREADS=2 \
        -l topology=SCCSCC /path/to/openmp_bin

("striding:0-3:2": beginning from core 0 to core 3 take every
second core
[resulting in 0 and 2]).


4. Risks
---------

When the host is not requested exclusively and more jobs are using
core binding,
collisions could occur which could lead to degraded performance
(because of
oversubscribing one core when others have nothing to do). This is
prevented
automatically by not performing the binding on the following jobs.
When the user requests a specific binding this could lead to better or
worse performance depending on the type of application. Hence binding
should only used by users knowing exactly the behaviour of their
applications and the details of the execution hosts. In most cases the
operating system scheduler is doing a good job, especially when hosts
are oversubscribed.

5. Future enhancements
----------------------

Topology aware scheduler. For this we need a clear concept of
threads, cores,
sockets which is currently not available in the system and
integrated the first
time with this specification. Let the scheduler select free hosts
in terms of
the requested <socket>,<core> pairs as a soft or a hard request.

Support for mixed OpenMPI/OpenMP jobs.

Including other operating systems.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?
dsForumId=38&dsMessageId=213897

To unsubscribe from this discussion, e-mail: [users-
unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].



------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214256

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].





More information about the gridengine-users mailing list