[GE users] Slot-wise suspend on subordinate

pollinger harald.pollinger at sun.com
Fri Jul 24 23:24:02 BST 2009


pollinger wrote:
> Hi,
>
> with the rising number of multi-core CPUs out in the fields, there is
> the need for a finer suspend on subordinate than the existing queue
> instance-wise suspend on subordinate.
>
> Attached is a draft of a specification for a slot-wise suspend on
> subordinate. Your feedback would be highly appreciated!
>
> Regards,
> Harald

There seems to be a bug in the mailing list, so I paste the specification into this mail:

Slotwise suspend on subordinate
===============================

    Version     Comments                           Date      Author
    --------------------------------------------------------
    1.0                 Initial Version                24-07-09  HP

1 Introduction
==============

    On multi-core hosts, the existing queue instance-wise suspend on subordinate
    is often not very useful. If there are many jobs running in a subordinated
    queue instance, all these jobs get suspended when a job gets scheduled to the
    superordinated queue, even if this job needs only one core.
    It would make more sense to just make sure the new, high priority job gets as
    many cores as it needs to run. In the current Sun Grid Engine architecture,
    this leads to a slot-wise suspend on subordinate.

2 Project Overview
==================

2.1 Project Aim

    The goal is to avoid jobs beeing unnecessarily suspended in a multi-slot
    subordinated queue instance when a new job is started in a superordinated
    queue.

2.2 Project Benefit

    Better utilization of multi-core CPUs.


3 System Architecture
=====================

3.1 Configuration

    The configuration of the slot-wise suspend on subordinate will extend the
    queue instance-wise suspend on subordinate.

    Queue instance-wise suspend on subordinate is configured in the queue_conf:

    subordinate_list  <queue_name>[=<value>][,<queue_name>[=<value>],...]

    where
    <queue_name> is the name of the queue that is to be subordinated
    <value>      is the number of slots that must be filled in the superordinated
                    queue to trigger suspension in the subordinated queue. If
                    <value> is omitted, it is set to the number of slots.


    The slot-wise suspend on subordinate will be configured this way:

    subordinate_list   slots=<nr_of_slots>(<queue_name>[,queue=<queue_name>,...])[,slots=....]

    where
    slots         denotes it is slotwise suspend on subordinate
    <nr_of_slots> is either the pseudo variable "$slots" that is equal to the
                     number of slots in the superordinate queue instance
                     or
                     the sum of slots that can be occupied in all subordinated
                     queues and the superordinate queue before slots have to be
                     suspended in the subordinated queues.
    <queue_name>  the name of a subordinated queue

    This syntax allows to define several groups of queues that are subordinated
    to the current queue. It might be that it's not useful or too complex
    to allow more than one group, but the syntax would allow this.


    I'm not sure if it makes sense to mix both kinds of subordination, but the
    syntax would allow it, too, e.g.:

    subordinate_list son.q=5,slots=8(daughter.q, uncle.q),slots=3(son.q, queue2.q)


3.2 Functionality

    Like with queue-instance-wise suspend on subordinate, both the subordinate
    and the superordinate queue instance must be on the same host.

    It is assumed that there is a 1:1 mapping between cores and slots. There
    might be useful cases where a different mapping is used, but this would
    add unnecessary complexity here.

    For simplicity reasons, in this specification it is assumed that there are
    only serial batch jobs. Other kinds of jobs will be taken into account in
    a later version of this specification.


    The goal is to always have a free CPU core for every high priority job,
    without unnecessarily suspending low priority jobs.

    The algorithm will work like this:
    It is assumed the queues are configured before any jobs are submitted.
    For each group of queues and the superordinated queue, the sum of running
    jobs (i.e. the number of occupied slots) is calculated. If the sum grows
    to <nr_of_slots>+1, a job in one of the most subordinate queues is being
    suspended. From all most subordinated queues, the job with the shortest
    run time (wallclock time) is suspended.

    If the queue configuration is changed with running jobs, the above
    calculation has to be done for all jobs that are bound to the superordinated
    or one of the subordinated queues.

    If a job quits in either the superordinated or in the subordinated queue,
    the calculation is done again, and if possible, a job in the subordinated
    queue is unsuspended. From all suspended jobs, the one with the longest
    run time gets unsuspended. If there are jobs waiting to be run in one
    of the subordinated queues or in the superordinated queue, these waiting
    jobs must be checked at the same time as the suspended jobs to determine
    which one is getting to run.


    Examples:

    These tables show the behaviour over time.
    t  = total slots in this queue
    r  = slots with running jobs
    s  = slots with jobs suspended because of subordination
    Tn = point in time

    Actions between points in time:
    R  = Job started in this queue
    F  = Job finished in this queue
    W  = Job was submitted to this queue, but doesn't get scheduled to it,
         because queue is full because of subordination

    Example 1:
    8 core host
    queue_conf of father.q:
       subordinate_list slots=8(son.q)

                T0       T1       T2       T3       T4
                t/r/s    t/r/s    t/r/s    t/r/s    t/r/s
    father.q    8/0/0 R  8/1/0    8/1/0 R  8/2/0    8/2/0
      son.q     8/6/0    8/6/0 R  8/7/0    8/6/1 F  8/6/0


    Example 2:
    8 core host
    queue_conf of father.q:
       subordinate_list slots=5(son.q)

                T0       T1       T2       T3       T4
                t/r/s    t/r/s    t/r/s    t/r/s    t/r/s
    father.q    8/0/0 R  8/1/0 R  8/2/0 R  8/3/0 F  8/2/0
      son.q     8/4/0    8/4/0    8/3/1    8/2/2 F  8/3/0

    In this example, never all cores are used. This could be e.g. because there
    is a third queue on this host that is not related to these two and provides
    3 slots. Or because there is some not to SGE related process on this host
    that uses 3 cores.


    Example 3:
    8 core host
    queue_conf of grandfather.q:
       subordinate_list slots=8(father.q)
    queue_conf of father.q:
       subordinate_list slots=8(son.q)

                   T0       T1       T2       T3       T4       T5       T6       T7
                   t/r/s    t/r/s    t/r/s    t/r/s    t/r/s    t/r/s    t/r/s    t/r/s
    grandfather.q  8/0/0 R  8/1/0 R  8/2/0    8/2/0    8/2/0    8/2/0 R  8/3/0    8/3/0
      father.q     8/3/0    8/3/0    8/3/0 R  8/4/0 R  8/5/0 W  8/5/0    8/4/1 F  8/5/0
        son.q      8/5/0    8/4/1    8/2/2    8/1/3    8/0/4    8/0/4    8/0/4 F  8/0/4

    At T7, normally there should be 4 jobs running in father.q and 1 running in son.q,
    but because there is a job still waiting to be scheduled to father.q, this job runs
    now in father.q and in son.q there are still all jobs suspended.


    Example 4:
    8 core host
    queue_conf of grandfather.q:
       subordinate_list slots=8(father.q,uncle.q)
    queue_conf of father.q:
       subordinate_list slots=8(son.q,daughter.q)

    The jobs in queue son.q are older than the jobs in daughter.q, the jobs in queue
    uncle.q are older than the one in father.q.

                   T0       T1       T2       T3       T4       T5       T6       T7
                   t/r/s    t/r/s    t/r/s    t/r/s    t/r/s    t/r/s    t/r/s    t/r/s
    grandfather.q  8/1/0 R  8/2/0 R  8/3/0 R  8/4/0    8/4/0    8/4/0 R  8/5/0    8/5/0
      father.q     8/1/0    8/1/0    8/1/0    8/1/0 R  8/2/0 R  8/3/0    8/2/1    8/3/0
        son.q      8/3/0    8/3/0    8/3/0    8/2/1    8/1/2 W  8/0/3    8/0/3 F  8/0/2
        daughter.q 8/2/0    8/1/1    8/0/2    8/0/2    8/0/2 W  8/0/2    8/0/2    8/0/2
      uncle.q      8/1/0    8/1/0    8/1/0    8/1/0    8/1/0 W  8/1/0    8/1/0    8/1/0



3.3 Displaying queue status

    Like with the queue instance-wise suspend on subordinate, the "qstat -f"
    will display a "S" in the queue state field to denote suspension because
    of subordination. But with slot-wise suspend on subordinate, not all jobs
    will be in "S" state, most of the time some of the will be in "r" state.
    If there are all jobs suspended in a queue instance, it will look the same
    for queue instance-wise and slot-wise suspend on subordinate. To enable
    the user to find out the suspension reason, the "-explain" switch of "qstat"
    will be extended. "qstat -explain S" will print the suspension reason, for
    both queue instance-wise and slotwise suspend on subordinate.

    Example:
    The "qstat -f" of the Example 1 from 3.2 at T3:

    # qstat -f
    queuename                      qtype resv/used/tot. load_avg arch          states
    ---------------------------------------------------------------------------------
    father.q at host1                 BIPC  0/2/8          0.14     sol-sparc64
          7 0.55500 Sleeper    jobuser      r     07/23/2009 12:07:00     1
          9 0.55500 Sleeper    jobuser      r     07/23/2009 12:09:00     1
    ---------------------------------------------------------------------------------
    son.q at host1                    BIPC  0/7/8          0.14     sol-sparc64   S
          1 0.55500 Sleeper    jobuser     r      07/23/2009 12:01:00     1
          2 0.55500 Sleeper    jobuser     r      07/23/2009 12:02:00     1
          3 0.55500 Sleeper    jobuser     r      07/23/2009 12:03:00     1
          4 0.55500 Sleeper    jobuser     r      07/23/2009 12:04:00     1
          5 0.55500 Sleeper    jobuser     r      07/23/2009 12:05:00     1
          6 0.55500 Sleeper    jobuser     r      07/23/2009 12:06:00     1
          8 0.55500 Sleeper    jobuser     S      07/23/2009 12:08:00     1


    # qstat -explain S
    queuename                      qtype resv/used/tot. load_avg arch          states
    ---------------------------------------------------------------------------------
    father.q at host1                 BIPC  0/2/8          0.14     sol-sparc64
          7 0.55500 Sleeper    jobuser      r     07/23/2009 12:07:00     1
          9 0.55500 Sleeper    jobuser      r     07/23/2009 12:09:00     1
    ---------------------------------------------------------------------------------
    son.q at host1                    BIPC  0/7/8          0.14     sol-sparc64   S
            Slot-wise suspension because of subordination to queue "father.q at host1"
          1 0.55500 Sleeper    jobuser     r      07/23/2009 12:01:00     1
          2 0.55500 Sleeper    jobuser     r      07/23/2009 12:02:00     1
          3 0.55500 Sleeper    jobuser     r      07/23/2009 12:03:00     1
          4 0.55500 Sleeper    jobuser     r      07/23/2009 12:04:00     1
          5 0.55500 Sleeper    jobuser     r      07/23/2009 12:05:00     1
          6 0.55500 Sleeper    jobuser     r      07/23/2009 12:06:00     1
          8 0.55500 Sleeper    jobuser     S      07/23/2009 12:08:00     1



4 Risks
=======

    - Reserved resources
    - Interactive jobs
    - Parallel jobs
    - Jobs that request more than 1 core

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209409

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list