[GE users] Job suspension methods in Linux

reuti reuti at staff.uni-marburg.de
Tue Feb 10 22:06:01 GMT 2009


Am 10.02.2009 um 16:31 schrieb leeping:

> Hi there,
> I have a Beowulf cluster running Grid Engine 6.1u3.  I'm trying to
> implement a system where users have a "soft" limit, above which their
> jobs may be automatically suspended if other users need the slots.  I
> have two questions:
> 1) How could I implement job suspension?  Do I need to enter something
> for "Suspend Method" and "Resume Method"?  I would imagine the command
> "kill -STOP" should do the trick, since I am running a Linux system.

this is already the default in SGE: sending a STOP/CONT to the  
complete process group of the job. So you don't have to touch it.

> 2) Imagine the following scenario - User A runs a large batch of short
> jobs and surpasses the soft limit.  User B starts one very long job,
> causing the suspension of one of User A's short jobs and indefinitely
> delaying it.  How do I prevent this from happening?

You could define a checkpointing environment (which the short jobs  
must request), which will reschedule the (short) job when it gets  
suspended. Hence it will restart on a different node when one becomes  
available. Note, that SGE doesn't checkpoint anything on its own, but  
supplies only the interface to various checkpointing libraries. In  
the simplest case, your (short) job will restart from the beginning  
this way.

But there is nothing like a soft-limit in SGE for the number of  
running jobs. Once a job was granted to run, it's in the system. But  
what you could set up:

- one default queue for the jobs, which will be taken first (setup by  
a sequence number for the queues), and a defined RQS for each user  
what he is allowed to run in this queue (soft-limit).

- when he bypasses this limit, he has to use some sort of secondary  
queue and another RQS could also limit this usage (hard-limit). These  
jobs will be
suspended if something is running in the default queue on this system  
(by subordination). Well - a user could hurt hisself, when after some  
time another of his jobs is scheduled in the default queue on the  
same node.

Depending on your intended setup, it might be better to use a co- 
scheduler instead (run a script or alike as a cron-job, you need just  
one queue, and maybe suspend [to reschedule like mentioned above] one  
of the short jobs with the lowest runtime).

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list