[GE users] Managing the Queue + Nodes
reuti at staff.uni-marburg.de
Mon Nov 2 22:48:24 GMT 2009
Am 02.11.2009 um 17:12 schrieb scu98rkr:
> We have a 64 node cluster consisting of 32 dual node machines.
> There are
> about 3 users. 2 of the users run single processor jobs that usually
> last between 7 hours and 1-2 days and they tend to queue up a batch of
> test cases ie up 70 jobs each.
> Im running Gaussian and run many different types of jobs 1-2 hours or
> composite calculations several days. Recently I've started running
> processor open mp2 jobs. I tend to just run a few jobs at a time.
> Although I occasionally will run batches.
> We've never really come to a satisfactory conclusion of how to manage
> the resources most efficiently. Quite often 1-2 users will not be
> the node so I want the all of the resources open to every one. I've
> up the share policy to 33% each so Queued jobs will be order according
> to how much computing power each user is using on the cluster.
> Which is
> good but it still means the user with the least jobs has to wait till
> the previous jobs have finished until their (possibly 1 hour job)
> will run.
> Also as I mentioned earlier I've started running dual processor jobs.
> I've just come back over the weekend to find none of my jobs have ran
> even after being at the front of the queue because at no point have 2
> nodes on the same machine been free ( rather unsurprisingly)(I can
> pretend the jobs only uses 1 processor but I've noticed if you
> specify 2
> processors and some one else starts a job on the same machine the
> computation time become much slower than if you'd specified 1
> What I really need SGE to do is monitor the usage of each user
> check if
> any user is using more than 33% of the cluster. If there are currently
> any other jobs queued it needs to suspend the user over 33% jobs and
> replace them with the queued jobs. SGE doesnt seem to have any problem
> suspending jobs so can it running other jobs in that suspended space.
> I dont want limit peoples access to queues because I want the whole
> cluster available to 1 user if there is space.
a) as I read between the lines, you defined a PE with allocation_rule
$pe_slots and request two slots. What you need to avoid serial jobs
slipping in, is to turn on resource reservation in the scheduler
configuration `qconf -msconf`:
(or a more appropriate value)
and submit jobs with: qsub -R y ...
b) a share tree will honor the past usage, but AFAICS you request
only to honor the usage of the cluster right now:
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users