[GE users] High and Low Priority Queues.
agrajag at dragaera.net
Fri Sep 3 14:32:06 BST 2004
On Fri, 2004-09-03 at 08:52, Rajeev Mishra wrote:
> We have 16 Linux dual processor machines on a cluster and our
> requirement is to make two queues on same 16 machines.
> 1. High Priority queue (high_pri)
> 2. Low priority queue (low_pri)
> All the 16 machines are dual CPU and I have defined 2 slots for each
> machines. High_pri cluster queue includes all the 16 machinces so it
> has 32 slots. low_pri cluster queue only includes 5 machines out of 16
> machines (12,13,14,15 and 16) so it has 10 slots.
> In this case last 5 machines (12,13,14,15 and 16) got 4 slots each (2
> from high_pri and 2 from low_pri).
> I have put load threshold on each machine so a machine is not
> overloaded at any given time. But the kind of jobs we are running on
> this cluster uses cpu only when a job is doing computation. There are
> cases when a 4 processor job is running on
> two different machines (say machine #13 and #15 using 4 slots on
> hig_pri queue) and it is not doing computaton (Load on both the
> machines is ver low) and now another user submits a new 4 processors
> job in low_pri queue then SGE assigns assigns machine # 13 and # 15)
> for the new job.
> Now both the jobs suffers, when the job #1 submitted thru high_pri
> queue starts iterating as both the dual processors machines have 4
> processes running on each machine. (2 from hig_pri and 2 from low_pri)
In order to get around this problem I set the priority on the queues so
that high priority jobs have a nice value of 0, and low priority jobs
have a nice value of 19. That way high priority will still be high
> 2. I also made low_pri queue as a subordinate queue for high_pri
> queue. It does suspend the queue when max slots are filled in
> high_pri queue but it does not suspend running jobs on low_pri queue.
> Is this by design ? or I am missing something here.
This is one of the reasons I don't like subordinate queues. SGE uses
process groups to know what jobs in the queue to suspend. However a job
can easily create its own process group which will cause the job to not
receive the signals SGE sends it. This is common with MPICH jobs.
However, you shouldn't try to suspend all or part of an MPICH job as it
will cause it to crash.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users