[GE users] core binding with Sun Grid Engine

reuti reuti at staff.uni-marburg.de
Tue Aug 25 22:57:40 BST 2009


Hi,

just some thoughts I got about it: to me the -binding option looks
like putting much burden on the end user to know something about each
node's configuration. I would prefer in a first step to have the
option to define in a queue entry just "bind the processes to a
granted core yes/no". This would mean to start with item 5 :-/ . This
proposal looks like to be suited only for the exclusive usage of
nodes as I would also fear item 4, and for this I wouldn't put too
many features into it.

The syntax "-binding striding:0-3:2" could be more generic like in an
RQS: "-binding cores,sockets"

"-binding 2,3" - two cores on any three sockets (i.e. slots=6)
"-binding 2,{*}" - two cores on each socket in the machine
"-binding 2,*" - two cores on any socket (this will give only two
slots here)
"-binding {2},*" - two cores on any socket (this will give only two
slots here) and use the socket exclusive even when there are 4 cores
"-binding *,1" - all free cores on any socket
"-binding *,2" - all free cores on any two sockets
"-binding {*},2" - all cores on any two sockets (means exclusive use
of two complete sockets)
"-binding {2},{*}" - two cores on each socket and use node exclusive
in the end

To easy the things for the enduser, the -binding could be put in a
default request. But then we would need different default requests
for different queues/hosts for now.

-- Reuti


Am 24.08.2009 um 09:52 schrieb dagru:

> Hi,
>
> since the advent of complex multi-core CPUs and operating system
> support
> for CPU/core binding, several Sun Grid Engine users want to exploit
> the
> possibilities some operating systems offers.  You'll find a draft
> of the
> job2core-binding
> specification for the Solaris and Linux operating system appended to
> this email.
>
> Thank you for your feedback!
>
> With kind regards,
> Daniel
>
>
>
> Job to core binding
> -------------------
>
> Version Comments                                               Date    Author
> ----------------
>
> 1.0   Initial Version                                          08/13/2009  DG
> 1.1   Extending with definitions and architecture specifics
> 08/14/2009  DG
> 1.2   Added Solaris kstat support                              08/18/2009  DG
> 1.3   Added -binding linear:<amount>                           08/18/2009  DG
> 1.4   Added findings from meeting 08/19                        08/19/2009  DG
> 1.5     Added findings from meeting 08/20
> 08/21/2009  DG
>
> 1 Introduction
> --------------
>
> With the advent of complex multi-core CPUs and NUMA architectures
> on cluster nodes,
> the operating system scheduling is not always perfect for all kind
> of applications.
> For parallel applications there are scenarios where it might be the
> best that the
> processes/threads are distributed to different sockets available on
> the host, for
> others it might be better to place them on a single socket running
> on different cores.
>
> In the current Sun Grid Engine architecture there is just the
> concept of 'slots' but
> no meaning if the slots are reflecting sockets, cores, or hardware
> supported threads.
> Also performing core binding is currently unreflected in Sun Grid
> Engine. Until now
> it is up to the administrator and/or user to enhance his/her
> applications to perform
> such a binding.
>
> This specification is an enhancement for Sun Grid Engine Solaris
> and Linux operating
> system version. Reporting topology information and binding
> processes to specific
> cores on hosts is the foundation for additional fine grained NUMA
> settings, like
> the configuration of specific memory access patterns on application
> side.
>
> 1.1 Definitions
> ----------------
>
> This section contains several definitions of terms frequently used
> in this
> specification and with a specific meaning.
>
> 1.1.1 System topology
> ---------------------
>
> Within this specification the term topology refers to the
> underlying hardware
> structure of a Sun Grid Engine execution host. The topology
> describes the amount
> of sockets the machine has (and are available) and the amount of
> cores (or threads
> on SMT/CMT) each socket has. In case of a virtual machine, the
> topology of the
> virtual machine is reported.
>
>
> 1.1.2 Core affinity or core binding
> -----------------------------------
>
> The term core affinity (and within the spec also core binding)
> refers to the
> likelihood of a process to run on the same processor again after it
> was replaced
> by the OS scheduler. The core affinity (or also called processor
> affinity) can
> be influenced in Linux OS via a system call which takes a bitmask
> as parameter.
> In this bitmask each bit reflects one core. If a core is turned off
> (via a
> logical bit with the value 0) then the OS scheduler does avoid to
> migrate
> the process to that core. Per default (i.e. without binding) all
> cores are turned
> on, so that the process can be scheduled to an arbitrary core.
> In the Solaris operating system processor sets can be used, which
> are defining
> a set of processors on which only processes explicitly bound to
> this set
> are able to run.
>
> 1.1.3 Collisions of core bindings
> ---------------------------------
>
> Within this specification the term collision (of two or more core
> bindings)
> refers to the circumstance that there exists at least one pair of
> processes
> where both have set a (non-default) core affinity and both
> processes share
> at least one core. Another source of a collision is when the
> administrator
> allows just one process per socket (in order to avoid
> oversubscribing socket
> related resources) and in addition to the running process on this
> socket a
> second process wants to use free cores on this socket. The problem of
> collisions are that cores or socket resources easily can be
> oversubscribed
> resulting in degraded performance while other sockets or cores are
> unused.
>
> 1.2 Operating system dependent issues
> -------------------------------------
>
> 1.2.1 Solaris specific behavior
> -------------------------------
>
> Sun Grid Engine supports currently the processor set feature from
> Solaris,
> which needs additional administrator configuration. Once a
> processor set is
> established it can be configured on PE level meaning all processes
> from PE
> are running within this set. On the other side, each processor
> which is
> assigned to a processor set will run only processes that are
> explicitly
> bound to that processors set. The only exception is when a process
> requires resources that are only available in the processor set then
> it is allowed to use this resource. Not all available processors/
> cores can
> be included in processor sets, at least one processor must remain
> available
> for the operating system. The binding to a processor set is
> inherited across
> fork and exec.
>
> Solaris 9 and higher supports the concept of locality groups which
> builds
> latency groups on NUMA systems. With this, topology related
> information in
> terms of memory latency could be retrieved. But it is not possible
> to get
> the actual amount of physical sockets and cores. For that the
> kernel kstat
> structure has to be used.
>
> Processor binding (binding LWP to a specific processor) can be
> performed
> via the processor_bind system call. Bindings are inherited across
> fork and
> exec system calls, but with this binding it is currently possible
> to bind
> a process/thread on only one core, which is different to the Linux
> behavior
> and can not be used (because of the danger of oversubscribing one
> core).
> Therefore Solaris processor sets have to be used. The processor
> sets differs
> from the Linux behavior in two important points: 1. Not all available
> cores on a single machine can be used for core binding (at least one
> core must be available for the OS). 2. The submitted job is running
> exclusivly on the cores on which it is bound to. That means that no
> unbound job is allowed to use these cores.
>
>
> 1.2.2 Linux specific behavior
> ------------------------------
>
> The Linux scheduler supports inherently soft affinity (which is
> also called
> natural affinity). This means the scheduler avoids process
> migrations from one
> processing unit to another. In the 2.5 kernel hard processor
> affinity (meaning
> that the scheduler can be told on which cores a specific process
> can run/not
> to run) was introduced. Patches for 2.4 kernel are available (for
> the system
> call sched_setaffinity(PID, afffinity bitmask)). Newer kernel
> versions are also
> NUMA aware and memory access policies can be set with the libnuma
> library.
> The Linux kernel includes a load_balancer component, which is
> called in specific
> intervals (like every 200 ms) or when one run-queue is empty (pull
> migration).
> Each processor has its own scheduler and own run-queue. The
> load_balancer
> tries to equal the amount runnable tasks between these run-queue.
> This is done
> via process migration.
>
> Setting a specific core affinity/binding is done via a affinity
> bitmask, which
> is accepted from the sched_setaffinity system-call a parameter.
> Example: 1011
> means the process will be bound to the first, second, and forth
> core (the scheduler
> only dispatches the process to the first, second, or fourth core
> even if the run-queue
> of core three is empty). The default mask (without affinity) is
> 1111 (on a four core
> machine) that means the scheduler can dispatch the process to any
> appropriate core.
> Core affinity is inherited by child processes. But each processes
> can redefine the
> affinity in any way.
>
> The /proc/cpuinfo file contains information about the processor
> topology.
> In order to simplify the access to the topology, which is different
> in different
> kernel versions and Linux distributions, an external API is used as an
> intermediate layer for this task. There were two APIs investigated,
> the libtopology from INRIA Bordeaux and the PLPA (portable linux
> processor
> affinity library) from the OpenMPI project. The libtopology offers
> support for
> different operating systems and reports also memory settings where
> PLPA is more
> lightweight and for Linux only. Because of the licence and approved
> stability,
> the PLPA (which is used from several projects including OpenMPI
> itself) is
> going to be used. With PLPA a simple mapping from logical
> <socket>,<core> pair
> to the internal processor ID (which has to be used in order to set
> the bitmask)
> can be done when the topology is supported. In order to support
> reporting the
> availability of SMT, the proc filesystem is parsed additionally.
>
> 2 Project Overview
> ------------------
>
> 2.1 Project Aim
> ---------------
>
> The goal is to provide more topology related information about the
> execution
> hosts and to give the user the ability to bind his jobs to specific
> cores on
> the execution system depending on the needs of the application.
>
> 2.2 Project Benefit
> -------------------
>
> Better performance of parallel applications. Depending on the core
> binding
> strategy and system topology also limited energy saving could be
> achieved (by
> just using a single socket for example, because some power-
> management is on socket
> level).
>
> 3 System Architecture
> ---------------------
>
> 3.1 Configuration
> -----------------
>
> Sun Grid Engine gets different load values (static and non-static)
> out-of-the-box
> from the execution hosts and reports them to the scheduler and user
> (which can be
> displayed via qhost or qstat for example). Based on these load-
> values the user can
> request resources and the scheduler makes its decisions. Currently
> there are no fine
> grained load values regarding the specific host topology. In order
> to give the
> user the ability to request specific hosts regarding their topology
> and/or to
> request a special core affinity (core binding) for the jobs the
> following new
> load values have to be introduced:
>
> Load value 'topology': Reports the topology on Solaris hosts and
> supported
> (depending on kernel version) Linux hosts otherwise 'NONE'. The
> topology is
> a string value with the following syntax:
>
>       <topology> ::= 'NONE' | <socket>
>       <socket>   ::= 'S' <core> [ <socket> ]
>       <core>     ::= 'C' [ <threads> ] [ <core> ]
>       <threads>  ::= 'T' [ <threads> ]
>
> Hence each 'S' means a socket and the following 'C's are the amount
> of cores.
> Please be aware that this is enhanced on some architectures with
> additional
> 'T's (threads on SMT/CMT architectures) per core.
>
> The topology string currently does not reflect the latency of
> memory to each
> CPU (i.e. NUMA non-NUMA differentiation).
>
> Examples:
> "SCCSCC" means a 2 socket host with 2 cores on each socket.
> "SCCCC" means a one socket machine with 4 cores.
> "SCTTCTT" means a one socket machine with 2 cores and
> hyperthreading (Intels name for CMT).
> "SCTTTTTTTTCTTTTTTTTCTTTTTTTTTCTTTTTTTTTCTTTTTTTTCTTTTTTTTCTTTTTTTTTCT
> TTTTTTTT" would be a
> Sun T2 processor with 8 execution units all of them supporting 8
> threads via chip logic.
>
> Note: Depending on your host setup (BIOS or kernel parameters) a C
> could also mean
> a thread on SMT/CMT system.
>
> [Possible solution for core/thread differentiation: Introduce SMT/
> CMT static
> load value which is per default 1 and when SMT is on 2 (or more
> depending on
> the SMT/CMT processor architecture) which has to be configured by
> admin
> regarding the BIOS/kernel settings. This could be used as a divisor.]
>
> Load value 'm_socket': The total amount of sockets available on a
> machine.
> If the machine has no supported topology it is equal to the
> existing "cpu"
> load value.
>
> Load value 'm_core': The total amount of cores available on a
> machine. If
> the machine has no supported topology it is equal to the existing
> "cpu" value.
>
> Load value 'm_thread': The total amount of threads the machine
> supports. In
> case of CMT/SMT this could be a multiple of 'core's available. If
> the machine
> has no supported topology it is equal to the existing 'cpu' value.
>
> With the new load value 'm_core' the installation routine for
> execution hosts
> is changed so that the 'slots' value is in the default case the
> amount of cores
> found.
>
> 3.2 Functionality
> -----------------
>
> 3.2.1 New "qsub -binding" parameter
> -----------------------------------
>
> In order to give the user the ability to request a special core
> affinity needed
> for his/her application a new submission parameter ('-binding') has
> to be
> introduced. With this parameter a special setup for the submitted
> job can be
> requested. A specific core is described via a
> <socket_number>,<core_number>
> tuple. Counting begins at 0 so that '0,1' describes the 2nd core on
> the first
> socket.
>
> To simplify the burden of generating long lists of socket-core-
> pairs the user
> can request different strategies. A strategy describes the method
> how those pairs
> should be created transparently by the system.
>
> With the new submission (CLI qsub) parameter '-binding' different
> core binding
> strategies could be selected. This should be used only in
> conjunction with exclusive
> host access for this job. The binding is done on Linux hosts only.
> Other hosts
> ignoring this so it us up to the user/admin to request a supported
> host. All
> processes started on one host from the job script have the same
> binding mask
> (which specifies which cores (not) to use). Doing a per-process
> binding is
> error-prone but could be a future enhancement. When a PE job is
> started using
> host exclusively the binding is applied to it but binding is
> applied also
> when a normal job script or binary job is started. The amount of
> requested
> slots does not restrict the amount of cores the process(es) are
> bound to
> because without binding a process is bound to every available core.
>
> The following new parameters are allowed (remark: the optional
> parameter
> [env|pe|set] is described below):
>
> '-binding [env|pe|set] linear:<amount>' := Sets the core binding so
> that
> <amount> of successive cores on a single socket are tried to be
> used. If there
> is no empty (in terms of already bound jobs) socket available an
> already partly
> occupied socket which offers the amount of cores is used. If this
> does fail
> consecutive cores on consecutive sockets are going to be used. If
> this fails no
> binding is done.
>
> '-binding [env|pe|set] linear:<amount>:[<socket>,]<core>' := Sets
> the core binding
> for the process/script started by qsub to <amount> following cores
> starting
> at core <core> on socket <socket>. If '<socket>,' is not set the
> socket number
> is calculated out of the <core> number. Note that first core on
> first socket is
> described as 0,0. If there is a misconfiguration (<amount> is too
> high, the [<socket>,]
> <core> is unavailable, or a collision occurs no core binding is done.
>
> '-binding [env|pe|set] striding:<firstcore>-<lastcore>:<n>' := Sets
> core binding
> in a core striding manner meaning that with the beginning of
> <firstcore> all
> cores to <lastcore> (including when possible) are used with a step-
> size of <n>.
> If <n> is 2 then every second core is taken for example. If there
> are range
> problems no mask is set at all. The core numbering is global, that
> means on
> a 2 socket (with 4 cores each) system the core denoted as 4 is the
> first core
> on the second socket (<1,0>).
>
> '-binding [env|pe|set] striding:<socket>,<core>-
> <socket>,<core>:<n>' := Sets
> core binding like before but with the difference that the first and
> last core
> are specified as core numbers on a specific socket.
>
> '-binding [env|pe|set] striding:<amount>:<n>' := Set the core
> binding like before
> but does not use a specified core and socket number as the first
> one. The system
> searches for <amount> cores with the offset <n> between them which
> are not
> already bound. When no placement is found (because amount is too
> high or
> in case of collisions) no binding is done.
>
> Note that the core affinity mask is set for the script/process
> which is started
> by Sun Grid Engine on the execution host. The mask is inherited by
> all child
> threads/processes that means that all subprocesses and threads are
> using the
> same set of cores. (In Linux OS child-processes are allowed to re-
> define the
> core affinity or even use all cores).
>
> The optional [env|pe|set] defines what is performed on execution
> side. 'env'
> means an envirnonment variable containing the logical core numbers
> the system
> determined is set. This is usually used from OpenMP applications in
> order to
> determine on which cores the application is allowed to run. As an
> example for
> Sun OpenMP applications the $SUNW_MP_PROCBIND environment variable
> can be
> directly set with the content (qsub ... -v SUNW_MP_PROCBIND=
> $SGE_BINDING).
>
> For OpenMPI jobs scattered on different hosts and using them
> exclusively the
> input for a 'rankfile' could be produced by Sun Grid Engine. The
> 'rankfile'
> reflects the binding strategy chosen at submission time by the
> user. For this
> the pe_hostfile is extended in order to list the host:socket:core
> triple for
> each MPI rank.
>
> 3.2.1 New "qhost" switch
> ------------------------
>
> A new qhost switch is introduced which shows the amount of sockets,
> cores
> and cpus as well as the topology string.
>
> 3.2.2 Extension of qconf -sep
> -----------------------------
>
> The qconf -sep command shows in addition to the hostname, amount of
> processors,
> and architecture, the amount of sockets, cores and hardware
> supported threads.
>
>
> 3.2.3 Extension of qstat -r
> ---------------------------
>
> The requested binding strategy is showed by qstat -r.
>
> 3.3 Implementation
> ------------------
>
> The implementation for reporting sockets, cores, and topology is
> done via the
> PLPA (portable linux processor affinity) library on the Linux
> operating system.
> Each socket and core reflects a physically present and available
> socket or core.
> Additionally the /proc filesystem is parsed in order to determine
> the availability
> of SMT if possible. On the Solaris operating system the amount of
> sockets, cores
> (and on some processors like the T2 also threads) are retrieved via
> the kernel
> kstat structure.
>
> The implementation of the -binding [linear|striding|explict]
> parameter means
> enhancing the command line submission client qsub, the execution
> daemon, and
> shepherd process which is forked for each job from the execution
> daemon. The
> internal CULL structure JB_Type has to be enhanced with an
> additional list
> (JB_binding) which contains the parameters. The execution daemon is
> writing
> then the strategy into the "config" file (which is also enhanced by
> a new line)
> for the shepherd. The shepherd is performing the binding when
> started to itself
> (because it is sleeping most of the time) and the binding is then
> inherited
> to the started processes (threads).
>
> An internal data structure which reflects the current load in
> respect to used
> threads, cores and sockets is held. The structure is similar to the
> topology
> string but with the difference that execution entities which are
> currently
> busy are shown as a dot. An example for a two socket machine with
> two cores
> each and running one parallel job with 2 processes on the first
> socket would
> be displayed as "...SCC" (the topology string is "SCCSCC").
>
> Question: Should this string containing the 'free' part of the
> topology
> also showed to the user in order to request it?
>
> In order to extend the pe_hostfile with the choosen socket and core
> when
> '-binding pe' was selected all requested hosts must have the same
> topology.
>
> 3.4 Other components/interfaces
> -------------------------------
>
> The DRMAA library has to accept the binding parameter within the
> native
> specification.
>
> Qmon is updated in order to reflect the new binding parameter.
>
>
> 3.5. Examples
> -------------
>
> Example 1: Show topology.
> -------------------------
>
> % qstat -F topology
> queuename                      qtype resv/used/tot. load_avg
> arch          states
> ----------------------------------------------------------------------
> -----------
> all.q at chamb                    BIPC  0/0/40         0.00     lx26-
> amd64
>       hl:topology=SCCCC
> ----------------------------------------------------------------------
> -----------
> all.q at gally2                   BIPC  0/0/40         0.05     lx26-
> amd64
>       hl:topology=NONE
> ----------------------------------------------------------------------
> -----------
> all.q at regen                    BIPC  0/0/40         0.25     lx26-
> amd64
>       hl:topology=SCCSCC
>
>
> Example 2: Show the amount of sockets on each execution host.
> -------------------------------------------------------------
>
> % qstat -F socket
>
> queuename                      qtype resv/used/tot. load_avg
> arch          states
> ----------------------------------------------------------------------
> -----------
> all.q at chamb                    BIPC  0/0/40         0.00     lx26-
> amd64
>       hl:m_socket=1
> ----------------------------------------------------------------------
> -----------
> all.q at gally2                   BIPC  0/0/40         0.03     lx26-
> amd64
>       hl:m_socket=0
> ----------------------------------------------------------------------
> -----------
> all.q at regen                    BIPC  0/0/40         0.20     lx26-
> amd64
>       hl:m_socket=2
>
>
>
> Example 3: Show the amount of cores on each execution host.
> -----------------------------------------------------------
>
> % qstat -F core
>
> queuename                      qtype resv/used/tot. load_avg
> arch          states
> ----------------------------------------------------------------------
> -----------
> all.q at chamb                    BIPC  0/0/40         0.00     lx26-
> amd64
>       hl:m_core=4
> ----------------------------------------------------------------------
> -----------
> all.q at gally2                   BIPC  0/0/40         0.04     lx26-
> amd64
>       hl:m_core=0
> ----------------------------------------------------------------------
> -----------
> all.q at regen                    BIPC  0/0/40         0.16     lx26-
> amd64
>       hl:m_core=4
>
>
> Example 4: Bind two jobs to different sockets.
> ----------------------------------------------
>
> (In order to get user exclusive access to an host an advance
> reservation with a
> parallel environment could be requested and submitted into.)
>
> On a 2 socket (with 2 cores each) machine an OpenMP job is
> submitted to the first
> socket (on the 2 cores) and an environment variable indicating the
> amount of threads
> OpenMP should use for a job is set.
>
> % qsub -pe testpe 4 -b y -binding linear:2:0,0 -v OMP_NUM_THREADS=4 \
>       -l topology=SCCSCC /path/to/openmp_bin
>
> ("linear:2": get 2 cores in a row; ":0,0" beginning on first socket
> and first core there).
>
> Bind the next job to the two cores on the other socket:
>
> % qsub -pe testpe 4 -b y -binding linear:2:1,0 -v OMP_NUM_THREADS=4 \
>       -l topology=SCCSCC /path/to/openmp_bin
>
> ("linear:2": get 2 cores in a row; ":1,0" beginning on 2nd socket
> and first core there).
>
> The same could be also done by submitting the jobs with the same
> parameter twice:
>
> % qsub -pe testpe 4 -b y -binding linear:2 -v OMP_NUM_THREADS=4 \
>       -l topology=SCCSCC /path/to/openmp_bin
> % qsub -pe testpe 4 -b y -binding linear:2 -v OMP_NUM_THREADS=4 \
>       -l topology=SCCSCC /path/to/openmp_bin
>
> Example 5: Allow the job to use only one core (the first one) on
> each of the two sockets.
> ----------------------------------------------------------------------
> -------------------
>
> % qsub -pe testpe 2 -b y -binding striding:0-3:2 -v
> OMP_NUM_THREADS=2 \
>       -l topology=SCCSCC /path/to/openmp_bin
>
> ("striding:0-3:2": beginning from core 0 to core 3 take every
> second core
> [resulting in 0 and 2]).
>
>
> 4. Risks
> ---------
>
> When the host is not requested exclusively and more jobs are using
> core binding,
> collisions could occur which could lead to degraded performance
> (because of
> oversubscribing one core when others have nothing to do). This is
> prevented
> automatically by not performing the binding on the following jobs.
> When the user requests a specific binding this could lead to better or
> worse performance depending on the type of application. Hence binding
> should only used by users knowing exactly the behaviour of their
> applications and the details of the execution hosts. In most cases the
> operating system scheduler is doing a good job, especially when hosts
> are oversubscribed.
>
> 5. Future enhancements
> ----------------------
>
> Topology aware scheduler. For this we need a clear concept of
> threads, cores,
> sockets which is currently not available in the system and
> integrated the first
> time with this specification. Let the scheduler select free hosts
> in terms of
> the requested <socket>,<core> pairs as a soft or a hard request.
>
> Support for mixed OpenMPI/OpenMP jobs.
>
> Including other operating systems.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?
> dsForumId=38&dsMessageId=213897
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214256

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list