[GE users] Dynamic queue -- sge schedule policy
reuti at staff.uni-marburg.de
Thu Aug 24 11:45:14 BST 2006
Am 24.08.2006 um 04:46 schrieb Eric Zhang:
> Hi, users:
> I have used sge for a while and I found a problem here.
> I have a 32 nodes cluster here. Some nodes in cluster have 4G
> and some nodes have 16G memory. Some jobs need at least 16G memory and
> the other jobs have no this limitation. In order to run the "16G
> job correctly, I created a queue(we call this queue as queue A) which
> contains all 16G memory nodes and created another queue(we call this
> queue as queue B) contains the nodes left. Here comes the problem:
> When I submit a "non 16G memory need" job, the job can only run in
> queue B even if the queue A has no load -- obviously this is a
> waste. In another hand, if I don't define the queue A and queue B, the
> "non 16G memory need" job will run on the 16G memory nodes, that
> the "16G memory need" job cannot run on it and has to waiting.
> I don't know whether the sge has a dynamic schedule policy? I
> if the queue A has no load, the "non 16G memory need" job can run
> on it;
> if the "16G memory need" job has been submitted, the "non 16G memory
> need" jobs will be paused and be migrated to queue B or, just in a
> simple way -- restart/reschedule these jobs.
Such a behavior you can get with the following setup:
- two hostgroups @mem16 @mem4 and attaching the appropriate machines
- one queue for the low memory job, but this queue contains both
- specifying a sequence number for the hostgroups in this queue, so
that the 4G nodes will be used first
- change the scheduler to sort by sequence number
For the jobs with 16G request, use a second queue but attach only the
@mem16, and subordinate the first queue. As only jobs on the same
host are suspended, it should be sufficient to simply subordinate the
compete other queue.
Okay, now the reschedule part: For this you will have to use a
checkpointing interface for the low memory jobs, because with this a
suspend (by the subordination) will trigger a reschedule/migrate.
Whether your jobs can be migrated, or will have to be restarted,
depends on your application. A Howto for this you can find here:
Whether you like to specify the queue in each qsub for your job, or
use a forced complex for the high memory queue and request it only
for the 16G jobs, is personal taste.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users