sched_conf - Grid Engine default scheduler configuration file


       sched_conf defines the configuration file format for Grid Engine's
       scheduler.  In order to modify the configuration, use the graphical
       user's interface qmon(1) or the -msconf option of the qconf(1) command.
       A default configuration is provided with the Grid Engine distribution

       Note, Grid Engine allows backslashes (\) be used to escape newline
       characters. The backslash and the newline are replaced with a space
       character before any interpretation.


       The following parameters are recognized by the Grid Engine scheduler if
       present in sched_conf:

       Note: Deprecated, may be removed in future release.
       Allows for the selection of alternative scheduling algorithms.

       Currently default is the only allowed setting.

       A simple algebraic expression used to derive a single weighted load
       value from all or part of the load parameters reported by sge_execd(8)
       for each host and from all or part of the consumable resources (see
       complex(5)) being maintained for each host.  The load formula
       expression syntax is that of a sum of weighted load values, that is:


       Note, no blanks are allowed in the load formula.
       The load values and consumable resources (load_val1, ...)  are
       specified by the name defined in the complex (see complex(5)).
       Note: Administrator-defined load values (see the load_sensor parameter
       in sge_conf(5) for details) and consumable resources available for all
       hosts (see complex(5)) may be used as well as Grid Engine default load
       The weighting factors (w1, ...) are positive integers. After the
       expression is evaluated for each host the results are assigned to the
       hosts and are used to sort the hosts corresponding to the weighted
       load. The sorted host list is used to sort queues subsequently.
       The default load formula is np_load_avg.

       The load which is imposed by the Grid Engine jobs running on a system
       varies in time, and often, e.g. for the CPU load, requires some amount
       of time to be reported in the appropriate quantity by the operating
       system. Consequently, if a job was started very recently, the reported
       load may not provide a sufficient representation of the load which is
       already imposed on that host by the job. The reported load will adapt
       to the real load over time, but the period of time in which the
       reported load is too low may already lead to an oversubscription of
       that host. Grid Engine allows the administrator to specify
       job_load_adjustments which are used in the Grid Engine scheduler to
       compensate for this problem.
       The job_load_adjustments are specified as a comma-separated list of
       arbitrary load parameters or consumable resources and (separated by an
       equal sign) an associated load correction value. Whenever a job is
       dispatched to a host by the scheduler, the load parameter and
       consumable value set of that host is increased by the values provided
       in the job_load_adjustments list. These correction values are decayed
       linearly over time until after load_adjustment_decay_time from the
       start the corrections reach the value 0.  If the job_load_adjustments
       list is assigned the special denominator NONE, no load corrections are
       The adjusted load and consumable values are used to compute the
       combined and weighted load of the hosts with the load_formula (see
       above) and to compare the load and consumable values against the load
       threshold lists defined in the queue configurations (see
       queue_conf(5)).  If the load_formula consists simply of the default CPU
       load average parameter np_load_avg, and if the jobs are very compute
       intensive, one might want to set the job_load_adjustments list to
       np_load_avg=1.00, which means that every new job dispatched to a host
       will require 100% CPU time, and thus the machine's load is instantly
       increased by 1.00.

       The load corrections in the "job_load_adjustments" list above are
       decayed linearly over time from the point of the job start, where the
       corresponding load or consumable parameter is raised by the full
       correction value, until after a time period of
       "load_adjustment_decay_time" the correction becomes 0. Proper values
       for "load_adjustment_decay_time" greatly depend upon the load or
       consumable parameters used and the specific operating system(s).
       Therefore, they can only be determined on-site and experimentally.  For
       the default np_load_avg load parameter a "load_adjustment_decay_time"
       of 7 minutes has proven to yield reasonable results.

       The maximum number of jobs any user may have running in a Grid Engine
       cluster at the same time. If set to 0 (default) the users may run an
       arbitrary number of jobs.

       At the time the scheduler thread initially registers with the event
       master thread in the sge_qmaster(8) process schedule_interval is used
       to set the time interval in which the event master thread sends
       scheduling event updates to the scheduler thread.  A scheduling event
       is a status change that has occurred within sge_qmaster(8) which may
       trigger or affect scheduler decisions (e.g. a job has finished and thus
       the allocated resources are available again).
       In the Grid Engine default scheduler the arrival of a scheduling event
       report triggers a scheduler run. The scheduler waits for event reports
       Schedule_interval is a time value (see sge_types(5) for a definition of
       the syntax of time values).  Setting it to 0 disables scheduling.

       This parameter determines in which order several criteria are taken
       into account to produce a sorted queue instance list which determines
       the preferred order for scheduling tasks to them (typically determining
       the order in which hosts are used).  Currently, two settings are valid:
       seqno and load. However in both cases, Grid Engine attempts to maximize
       the number of soft requests (see qsub(1) -s option) being fulfilled by
       the queues for a particular job as the primary criterion.
       Then, if the queue_sort_method parameter is set to seqno, Grid Engine
       will use the seq_no parameter as configured in the current queue
       configurations (see queue_conf(5)) as the next criterion to sort the
       queue list. The load_formula (see above) is only used as the next
       criterion if two queues have equal sequence numbers.  If
       queue_sort_method is set to load the load according the load_formula is
       the criterion after maximizing a job's soft requests, and the sequence
       number is only used if two hosts have the same load.  The sequence
       number sorting is most useful if you want to define a fixed order in
       which queues are to be filled (e.g. the cheapest resource first).

       The default for this parameter is load.

       When executing under a share based policy, the scheduler "ages" (i.e.
       decreases) usage to implement a sliding window for achieving the share
       entitlements as defined by the share tree. The halftime defines the
       time interval in which accumulated usage will have been decayed to half
       its value at the start of the interval.  (This is a radioactive-type
       exponential decay, where the parameter is usually called "half-life".)
       Valid values are specified in hours, default 168.
       If the value is set to 0, the usage is not decayed.

       Grid Engine accounts for the consumption of the resources CPU-time,
       memory and IO to determine the usage which is imposed on a system by a
       job. A single usage value is computed from these three input parameters
       by multiplying the individual values by weights and adding them up. The
       weights are defined in the usage_weight_list. The format of the list is


       where wcpu, wmem and wio are the configurable weights. The weights are
       real numbers. The sum of all three weights should be 1.  The default is

       Determines how fast Grid Engine should compensate for past usage below
       or above the share entitlement defined in the share tree. Recommended
       values are between 2 and 10, where 10 means faster compensation.  The
       default is 5.

       The relative importance of the user shares in the functional policy.
       Values are of type real.

       The relative importance of the project shares in the functional policy.
       Values are of type real.

       The relative importance of the department shares in the functional
       policy. Values are of type real.

       The relative importance of the job shares in the functional policy.
       Values are of type real.

       The maximum number of functional tickets available for distribution by
       Grid Engine. Determines the relative importance of the functional
       policy.  See under sge_priority(5) for an overview on job priorities.

       The maximum number of share based tickets available for distribution by
       Grid Engine. Determines the relative importance of the share tree
       policy. See under sge_priority(5) for an overview on job priorities.

       The weight applied on the remaining time until a job's latest start
       time. Determines the relative importance of the deadline. See under
       sge_priority(5) for an overview on job priorities.

       The weight applied on the job's waiting time since submission.
       Determines the relative importance of the waiting time.  See under
       sge_priority(5) for an overview on job priorities.

       The weight applied on jobs' normalized urgency when determining the
       priority finally used.  Determines the relative importance of urgency.
       See under sge_priority(5) for an overview on job priorities.

       The weight applied on jobs' normalized POSIX priority when determining
       the priority finally used. Determines the relative importance of POSIX
       priority.  See under sge_priority(5) for an overview on job priorities.

       The weight applied on the normalized ticket amount when determining the
       priority finally used.  Determines the relative importance of the
       ticket policies. See under sge_priority(5) for an overview on job

       This parameter is provided for tuning the system's scheduling behavior.
       By default, a scheduler run is triggered in the scheduler interval.
       When this parameter is set to 1 or larger, the scheduler will be
       triggered that number of seconds after a job has finished. Setting this
       parameter to 0 disables the flush after a job has finished.

       This parameter is provided for tuning the system's scheduling behavior.
       By default, a scheduler run is triggered in the scheduler interval.
       When this parameter is set to 1 or larger, the scheduler will be
       triggered that number of seconds after a job was submitted to the
       system. Setting this parameter to 0 disables the flush after a job was

       The default scheduler can keep track of why jobs could not be scheduled
       during the last scheduler run. This parameter enables or disables the
       observation.  The value true enables the monitoring false turns it off.

       It is also possible to activate the observation only for certain jobs.
       This will be done if the parameter is set to job_list followed by a
       comma-separated list of job ids.

       The user can obtain the collected information with the command qstat

       This is for passing additional parameters to the Grid Engine scheduler.
       The following values are recognized:

              If set, overrides the default of value 60 seconds.  This
              parameter is used by the Grid Engine scheduler when planning
              resource utilization as the delta between net job runtimes and
              total time until resources become available again. Net job
              runtime as specified with -l h_rt=...  or -l s_rt=...  or
              default_duration always differs from total job runtime due to
              delays before and after actual job start and finish. Among the
              delays before job start is the time until the end of a
              schedule_interval, the time it takes to deliver a job to
              sge_execd(8), and the delays caused by prolog in queue_conf(5),
              start_proc_args in sge_pe(5) and starter_method in
              queue_conf(5).  The delays after job finish include those due to
              a forced job termination (notify, terminate_method or
              checkpointing), procedures run after actual job finish, such as
              stop_proc_args in sge_pe(5) or epilog in queue_conf(5), and the
              delay until a new schedule_interval.
              If the offset is too low, resource reservations (see
              max_reservation) can be delayed repeatedly due to an overly
              optimistic job circulation time.

              Note: Deprecated, may be removed in future release.
              If set to true, the scheduler limits the number of jobs it looks
              at during a scheduling run. At the beginning of the scheduling
              run it assigns each job a specific category, which is based on
              the job's requests, priority settings, and the job owner. All
              scheduling policies will assign the same importance to each job
              in one category. Therefore the number of jobs per category has a
              FIFO order and can be limited to the number of free slots in the

              An exception is jobs which request a resource reservation. They
              are included regardless of the number of jobs in a category.

              This setting is turned off by default, because in very rare
              cases, the scheduler can make a wrong decision. It is also
              advised to turn report_pjob_tickets off.  Otherwise qstat -ext
              can report outdated ticket amounts. The information shown with a
              qstat -j for a job that was excluded in a scheduling run is very

              If set equal to 1, the scheduler logs profiling information
              summarizing each scheduling run.

              If set equal to 1, the scheduler records information for each
              scheduling run, enabling reproduction of job resource
              utilization in the file <sge_root>/<cell>/common/schedule.

              This parameter sets the algorithm for the PE range computation.
              The default is automatic, which means that the scheduler will
              select the best one, and it should not be necessary to change it
              to a different setting in normal operation. If a custom setting
              is needed, the following values are available:
              auto: the scheduler selects the best algorithm
              least: starts the resource matching with the lowest slot amount
              bin: starts the resource matching in the middle of the pe slot
              highest: starts the resource matching with the highest slot
              amount first.

       Changing params will take immediate effect.  The default for params is

       Interval (HH:MM:SS) to reprioritize jobs on the execution hosts based
       on the current ticket amount for the running jobs. If the interval is
       set to 00:00:00 the reprioritization is turned off. The default value
       is 00:00:00.  The reprioritization tickets are calculated by the
       scheduler and update events for running jobs are only sent after the
       scheduler calculated new values. How often the scheduler should
       calculate the tickets is defined by the reprioritize_interval.  Because
       the scheduler is only triggered in a specific interval
       (scheduler_interval) this means the reprioritize_interval only has a
       meaning if set greater than the scheduler_interval.  For example, if
       the scheduler_interval is 2 minutes and reprioritize_interval is set to
       10 seconds, this means the jobs get re-prioritized every 2 minutes.

       This parameter allows tuning the system's scheduling run time. It is
       used to enable/disable the reporting of pending job tickets to the
       qmaster.  It does not influence the tickets calculation. The sort order
       of jobs in qstat and qmon is only based on the submit time when the
       reporting is turned off.
       The reporting should be turned off in a system with a very large amount
       of jobs by setting this parameter to "false".

       The halflife_decay_list allows configuring different decay rates for
       the finished_jobs usage types, which is used in the pending job ticket
       calculation to account for jobs which have just ended. This allows the
       user the pending jobs algorithm to count finished jobs against a user
       or project for a configurable decayed time period. This feature is
       turned off by default, and the halftime is used instead.
       The halflife_decay_list also allows one to configure different decay
       rates for each usage type being tracked (cpu, io, and mem). The list is
       specified in the following format:


       usage_type can be one of cpu, io, or mem.  time can be -1, 0 or a
       timespan specified in minutes. If time is -1, only the usage of
       currently running jobs is used. 0 means that the usage is not decayed.

       This parameter sets up a dependency chain of ticket-based policies.
       Each ticket-based policy in the dependency chain is influenced by the
       previous policies and influences the following policies. A typical
       scenario is to assign precedence for the override policy over the
       share-based policy. The override policy determines in such a case how
       share-based tickets are assigned among jobs of the same user or
       project.  Note that all policies contribute to the ticket amount
       assigned to a particular job regardless of the policy hierarchy
       definition. Yet the tickets calculated in each of the policies can be
       different, depending on "POLICY_HIERARCHY".

       The "POLICY_HIERARCHY" parameter can be an up to 3 letter combination
       of the first letters of the 3 ticket based policies S(hare-based),
       F(unctional) and O(verride). So a value "OFS" means that the override
       policy takes precedence over the functional policy, which finally
       influences the share-based policy.  Less than 3 letters means that some
       of the policies do not influence other policies and also are not
       influenced by other policies. So a value of "FS" means that the
       functional policy influences the share-based policy and that there is
       no interference with the other policies.

       The special value "NONE" switches off policy hierarchies.

       If set to "true" or "1", override tickets of any override object
       instance are shared equally among all running jobs associated with the
       object. The pending jobs will get as many override tickets, as they
       would have, when they were running. If set to "false" or "0", each job
       gets the full value of the override tickets associated with the object.
       The default value is "true".

       If set to "true" or "1", functional shares of any functional object
       instance are shared among all the jobs associated with the object. If
       set to "false" or "0", each job associated with a functional object,
       gets the full functional shares of that object. The default value is

       The maximum number of pending jobs to schedule in the functional
       policy.  The default value is 200.

       The maximum number of subtasks per pending array job to schedule. This
       parameter exists in order to reduce scheduling overhead. The default
       value is 50.

       The maximum number of reservations scheduled within a schedule

       When a runnable job can not be started due to a shortage of resources a
       reservation can be scheduled instead. A reservation can cover
       consumable resources with the global host, any execution host, and any
       queue. For parallel jobs reservations are done also for the slots
       resource as specified in sge_pe(5).  The top max_reservation jobs (in
       priority order) are considered, not individual resources.  The job
       runtime assumed is the maximum of the time specified with -l h_rt=...
       or -l s_rt=...  For jobs that have neither of them, the
       default_duration (see below) is assumed.

       Reservations prevent jobs of lower priority as specified in
       sge_priority(5) from utilizing the reserved resource quota during the
       time of reservation.  Jobs of lower priority are allowed to utilize
       those reserved resources only if their prospective job end is before
       the start of the reservation ("backfilling").  Reservation is done only
       for non-immediate jobs (-now no) that request reservation (-R y). If
       max_reservation is set to "0" no job reservation is done.

       max_reservation actually has a more general effect on scheduler look-
       ahead, and it is necessary to turn it on for correct backfilling into
       calendar windows (see calendar_conf(5)).

       Note that reservation scheduling can be performance consuming and hence
       reservation scheduling is switched off by default. Since reservation
       scheduling performance consumption is known to grow with the number of
       pending jobs, the use of the -R y option is recommended only for those
       jobs actually queuing for bottleneck resources.  Together with the
       max_reservation parameter, this technique can be used to narrow down
       performance impacts.  A JSV can be used to add reservation requests for
       particular resources, such as large parallel jobs.

       When job reservation is enabled through the max_reservation
       sched_conf(5) parameter, the default_duration is assumed as runtime for
       jobs that have neither -l h_rt=...  nor -l s_rt=...  specified. In
       contrast to an h_rt/s_rt time limit, the default_duration is not
       enforced.  The default value is INFINITY, and reservation is not
       effective for jobs which get that value, i.e. the value must be finite,
       or jobs must specify a run time.


                  scheduler thread configuration


       sge_intro(1), qalter(1), qconf(1), qstat(1), qsub(1), complex(5),
       queue_conf(5), sge_execd(8), sge_qmaster(8)


       See sge_intro(1) for a full statement of rights and permissions.

SGE 8.1.3pre                      2011-05-17                     SCHED_CONF(5)

Man(1) output converted with man2html