Adapted from Andreas Haas' description in the original issue tracker.

It is possible to simulate large numbers of execution hosts for testing scheduler performance, for instance.

  1. Install a new SGE cluster in the appropriate configuration, but only the qmaster.

  2. After successful installation, use qconf -mconf to set


    In the qmaster_params section of sge_conf(5), which suppresses ‘unknown’ queue states.

  3. Ensure the configured queues use load_thresholds none. The simulator has no means to simulate load values, which will thus always be missing, so load thresholds would cause load alarm queue states that prevent scheduler from dispatching jobs to the queue.

  4. Use qconf -ae or -Ae to create arbitrary an number of simulated execution hosts. The hosts need not exist as the qmaster won’t try to send anything to them, but their host names must be resolvable.

  5. (Optionally) if you care about scheduler runtimes set


    in the params section of sched_conf(5) using qconf -msconf.

Now your simulated cluster is ready and you can submit arbitrary numbers of jobs. The scheduler will dispatch them and send corresponding qmaster orders. The qmaster will behave as if it would start the jobs, but it raise timers to ensure job state transitions are passed as used.

qdel will work, but things that won’t include interactive jobs (i.e. qrsh etc.) and tightly integrated parallel jobs, i.e. with control_slaves set to true in sge_pe(5), and suspension (qmod). Jobs' runtime can be controlled via the first job argument. That means when

$ qsub -b y /bin/sleep 5

is submitted, the job will finish after five seconds.

To summarize the mechanism:

  1. Queue instance state ‘unknown’ is generally suppressed so that the scheduler can assign jobs to any queue instance where load_thresholds is set to none;

  2. Job delivery to execds is bypassed. Instead a timer causes jobs to go from the ‘transferring’ state into ‘running’ after three seconds.

  3. Once a job is in the ‘running’ state another timer is set to simulate job finish after n seconds, where n is taken from the first job argument.