[GE users] default "-l" attributes for each queue
txema.heredia at upf.edu
Wed Sep 16 12:07:04 BST 2009
thanks for your answer.
> can you elaborate this in more detail?
I'll do it.
My users jobs can be lumped into 3 groups: jobs that last less than 1 hour, jobs that last less than 1 day, and jobs that last "a lot".
I have 8 computing nodes (8-core each), and currently all types of jobs are running on all nodes, causing some "fairness" problems, even though we are using a user-ticket policy (due to the job duration differences).
So I've decided to split my cluster in 3 blocks (and I made a queue for each one of them), each one of them dedicated to a specific kind of job. This way, even though there are lots of "slow" jobs running, it won't affect the "fast" ones (Users tend to launch from 100 to 1.000 jobs at the same time, but they tend to be of only one kind. This way there shouldn't be disturbances among projects). But this causes a problem: if there are no jobs of a given type, those nodes are not working, even if there are thousands of jobs waiting in the other queues.
So, I decided to create a system where my jobs will run "primarily" in their specific block of nodes. If that block has all its slots filled, it will try to submit jobs to the other nodes if they aren't busy, but only filling up to 6 of the 8 possible slots. This way the system keeps at least 2 slots for the preferred kind of jobs[ to run in their "priority" hosts] (and once those 2 slots are filled, the queue's load_threshold prevents other "non-priority" jobs to be scheduled).
In order to do that I created several consumables which "count" how many jobs of each kind I have running in that host:
num_jobs = number of total jobs running in the host (I want up to 8 jobs running in any host)
fast_jobs = number of "fast" jobs running in the host ( 8 in the "fast-priority" hosts) used for load threshold to stop queues when this is 2 or more.
med_jobs = number of "medium" jobs running in the host ( 8 in the "medium-priority" hosts) used for load threshold to stop queues when this is 2 or more.
slow_jobs = number of "slow" jobs running in the host ( 8 in the "slow-priority" hosts) used for load threshold to stop queues when this is 2 or more.
fast_med = the sum of "fast" and "medium" (not "slow") jobs running on a host. Used for load threshold to stop queues when this is 6 or more in the "slow-priority" hosts.
fast_slow = the sum of "fast" and "slow" (not "medium") jobs running on a host. Used for load threshold to stop queues when this is 6 or more in the "medium-priority" hosts.
med_slow = the sum of "medium" and "slow" (not "fast") jobs running on a host. Used for load threshold to stop queues when this is 6 or more in the "fast-priority" hosts.
So, in order to run my jobs I have to type this:
fast jobs --> qsub -q fast -l num_jobs=1 -l fast_jobs=1 -l fast_med=1 -l fast_slow=1 ...
medium jobs --> qsub -q med -l num_jobs=1 -l med_jobs=1 -l fast_med=1 -l med_slow=1 ...
slow jobs --> qsub -q slow -l num_jobs=1 -l slow_jobs=1 -l fast_slow=1 -l med_slow=1 ...
If I type this manually it works, and everything is OK, but those params are mandatory for the system to work, and if you obviate one, it could lead to blocking other people's jobs, ...plus I do NOT trust my users (they are all biologists). The less they have to do, the better for the system stability
> The idea behind SGE is to select a queue for you according to the
> given resource requests (in contrast to other queuing systems where
> you submit into a queue).
> The resource requests will be used by SGE to schedule it to a queue
> which fulfills the request - hence they must be known. Requesting a
> queue and resources at the same time may be redundant. Do you want
> some values in the jobscript having certain values, which the scripts
> should use?
> -- Reuti
> PS: Nevertheless, when I get you right: give all queues all
> parameters, and the not appropriate one set to a high value (assuming
> they are consumable), e.g.:
> q1: param1=3,param2=9,param3=27,param4=9999,param5=9999,param6=9999
> q2: param1=9999,param2=9999,param3=9999,param4=16,param5=8,param6=4
> qsub -l param1=3,param2=9,param3=27,param4=16,param5=8,param6=4
> should run in both queues.
Yes, this would work, but the problem is the same as before, It's too complicated for trusting my users to use it properly
> > I suppose that there won't be any solution, but I have to try ;)
> > PS: I'm using 6.1u4
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?
> > dsForumId=38&dsMessageId=217346
> > To unsubscribe from this discussion, e-mail: [users-
> > unsubscribe at gridengine.sunsource.net].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users