Tuning guide

Grid Engine is a full function, general purpose Distributed Resource Management (DRM) tool. The scheduler component in Grid Engine supports a wide range of different compute farm scenarios. To get the maximum performance from your compute environment it can be worthwhile to review which features are enabled and which are really needed to solve your load management problem. Disabling/Enabling  these features can have a performance benefit on the throughput of your cluster. Each feature contains in parentheses when it was introduced. If not otherwise stated, it is available in higher versions as well.

  • overall cluster tuning

    Experience has shown utilization of NFS or similar shared file systems for distributing files required by Grid Engine can have a critical share in both overall network load and file server load. Thus keeping such files locally is always at least slightly beneficial for overall cluster throughput, but at the cost of easier monitoring/debugging which may not be a good trade-off in low-throughput cases. The HOWTO Reducing and Eliminating NFS usage by Grid Engine. shows different common choices for accomplishing this.

  • scheduler monitoring

    Scheduler monitoring can be helpful to find out the reason why certain jobs are not dispatched (displayed via qstat). However, providing this information for all jobs at any time can be resource consuming (memory and CPU time) and is usually not needed. To disable scheduler monitoring set schedd_job_info to false in scheduler configuration sched_conf(5).

  • finished jobs

    In case of array jobs the finished job list in qmaster can become quite big. Switching it off will save memory and speed up qstat commands because qstat also fetches the finished jobs list. Set finished_jobs to 0 in global configuration. See sge_conf(5).

  • job verification

    Forcing validation at job submission time can be a valuable tool to prevent non-dispatchable jobs from remaining in pending state forever. However, It can be time consuming to validate jobs, especially in heterogeneous environments with a variety of different execution nodes and consumable resources and where every user has his own job profile. In homogeneous environments with only a couple of different jobs, a general job validation usually can be omitted. Job verification is disabled per default and should only be used (qsub(1): -w [v|e|w]) when needed. [It is enabled by default with DRMAA.]

  • load thresholds and suspend thresholds

    Load thresholds are needed if you deliberately oversubscribe your machines, and you need a mechanism to prevent excessive system load. Suspend thresholds are also used for this. The other case in which load thresholds are needed is when the execution node is open for interactive load which is not under control of Grid Engine, and you want to prevent the node from being overloaded. If a compute farm is more single-purpose, e.g., each CPU at a compute node is represented by only one queue slot, and no interactive load is expected at these nodes, then load_thresholds can be omitted. To disable both thresholds set load_thresholds to none and suspend_thresholds to none. See queue_conf(5).

load_thresholds are applicable to consumable resources as well (see queue_conf(5)). Using this feature will have a negative impact on the scheduler performance.
  • load adjustments

    Load adjustments are used to increase virtually the measured load after a job has been dispatched. This mechanism is helpful in the case of oversubscribed machines in order to align with load thresholds. Load adjustments should be switched off if they are not needed, because they impose on the scheduler some additional work in connection sorting hosts and load thresholds verification. To disable load adjustments set job_load_adjustments to none and load_adjustment_decay_time to 0 in the scheduler configuration. See sched_conf(5).

  • scheduling-on-demand

    The default for Grid Engine is to start scheduling runs in a fixed scheduling interval (see schedule_interval in sched_conf(5)). The good thing with fixed intervals is that they limit the CPU time consumption of the qmaster/scheduler. The bad thing is that they throttle the scheduler artificially, resulting in a limited throughput. In many compute farms there are machines specifically dedicated to qmaster/scheduler and in such setups there is no reason for throttling the scheduler. How many seconds one should use for flush times is difficult to say. It depends on the time the scheduler needs for a single run and the number of jobs in the system. A couple test runs with the scheduler profiling (Add profile=1 to the params in the sched_conf(5).) should give one enough data to select a good value.

  • Scheduling-on-demand can be configured using the FLUSH_SUBMIT_SEC and FLUSH_FINISH_SEC settings in the sched_conf(5). If it is activated, the throughput of a compute farm is only limited by the power of the machine hosting qmaster/scheduler.

  • scheduler priority information

    After every scheduling interval, the scheduler sends the calculated priority information (tickets, job priority, urgency) to the qmaster. This information is used to order the job output in "qstat -ext", "-urg", and "-pri". The transfer of the information can be turned off by setting report_pjob_tickets to false in sched_conf(5).
  • policies
    The scheduler contains different policy modules (see sge_priority(5)) to compute the importance of a job:
    • ticket policy
    • urgency policy
    • POSIX priority policy
    • deadline policy
    • waiting time policy
    All policies are turned on by default. If one or two of them are not used, it is preferable to turn the policy off by setting its weighting factor to 0 in sched_conf(5).
  • resource reservation

    Resource reservation prevents the starvation of jobs with high resource requests. The configuration of the scheduler allows one to enable/disable this feature as well as limit the number of jobs which will get a reservation. Turning off this feature, by setting max_reservation to 0 in sched_conf(5), will have a positive impact on the scheduler run time.
    If resource reservation is needed, the number of jobs which will get a reservation from the scheduler should be as small as possible. This is done by setting a small number for max_reservation in sched_conf(5).
  • optimization of qmaster memory consumption

    In clusters with large quantities of jobs a limiting factor is often the memory footprint required to store all job properties. Experience shows large parts of the memory occupied by the qmaster are used to store each job's environment as specified via "-v variable_list" or "-V". End users sometimes perceive it as convenient to simply use "-V", even though it would have been entirely sufficient to inherit a handful of specific environment variables from the submission environment. Conscious and sparing use of job environment variables has been shown to greatly increase the maximum number of jobs that can be processed with a given amount of main memory by Grid Engine.
  • use "-b y" to unburden qmaster

    By default Grid Engine qsub job submission sends the job scripts together with the job itself. The -b y option can be used to prevent job scripts from being sent, instead simply sending the path to the executable along with the job. This technique requires that the script be made available elsewhere, but in many cases the script is already available or could easily be made available by means of shared file systems. Use of -b y has a beneficial impact on cluster scalability because job scripts do not need to be stored on disk by the qmaster at submission time or be packed with the job when it is delivered to the execd.
  • job filter based on job classes
    The job filter can be enabled by adding JC_FILTER=1 to the params field in sched_conf(5). This feature is deprecated and, if enabled, can lead some minor problems in the system.

    If enabled, the scheduler limits the number of jobs it looks at during a scheduling run. At the beginning of the scheduling run it assigns each job a specific category based on the job's requests, priority settings, and the job owner. All scheduling policies will assign the same importance to each job in a category. Therefore, the number of jobs per category will have a FIFO order and can be limited to the number of free slots in the system.
    An exception is jobs which request a resource reservation. They are included regardless of the number of jobs in a category.
    This setting is turned off per default, because in very rare cases the scheduler can make a wrong decision. It is also advised to turn report_pjob_tickets off when this feature is used. Otherwise "qstat -ext" can report outdated ticket amounts. The information shown with a "qstat -j " for a job that was excluded in a scheduling run is very limited.

Scheduler profiles, such as are used during Grid Engine installation, can be stored using "qconf -ssconf >file". The profiles are not stored internally. With the combination of dynamically changing the scheduler configuration by loading a new profile with "qconf -Msconf <file>" and a cron job, one can switch to a leaner configuration over night and return to a user friendly configuration during the day.