[GE users] Job suspension methods in Linux
reuti at staff.uni-marburg.de
Tue Feb 10 22:06:01 GMT 2009
Am 10.02.2009 um 16:31 schrieb leeping:
> Hi there,
> I have a Beowulf cluster running Grid Engine 6.1u3. I'm trying to
> implement a system where users have a "soft" limit, above which their
> jobs may be automatically suspended if other users need the slots. I
> have two questions:
> 1) How could I implement job suspension? Do I need to enter something
> for "Suspend Method" and "Resume Method"? I would imagine the command
> "kill -STOP" should do the trick, since I am running a Linux system.
this is already the default in SGE: sending a STOP/CONT to the
complete process group of the job. So you don't have to touch it.
> 2) Imagine the following scenario - User A runs a large batch of short
> jobs and surpasses the soft limit. User B starts one very long job,
> causing the suspension of one of User A's short jobs and indefinitely
> delaying it. How do I prevent this from happening?
You could define a checkpointing environment (which the short jobs
must request), which will reschedule the (short) job when it gets
suspended. Hence it will restart on a different node when one becomes
available. Note, that SGE doesn't checkpoint anything on its own, but
supplies only the interface to various checkpointing libraries. In
the simplest case, your (short) job will restart from the beginning
But there is nothing like a soft-limit in SGE for the number of
running jobs. Once a job was granted to run, it's in the system. But
what you could set up:
- one default queue for the jobs, which will be taken first (setup by
a sequence number for the queues), and a defined RQS for each user
what he is allowed to run in this queue (soft-limit).
- when he bypasses this limit, he has to use some sort of secondary
queue and another RQS could also limit this usage (hard-limit). These
jobs will be
suspended if something is running in the default queue on this system
(by subordination). Well - a user could hurt hisself, when after some
time another of his jobs is scheduled in the default queue on the
Depending on your intended setup, it might be better to use a co-
scheduler instead (run a script or alike as a cron-job, you need just
one queue, and maybe suspend [to reschedule like mentioned above] one
of the short jobs with the lowest runtime).
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users