[GE users] SGE resources and job queues.

Chris Dagdigian dag at sonsorol.org
Wed May 11 22:27:01 BST 2005

Grid Engine 6.x has the concept of "hostgroups" which may be easier  
to set up if you want to group your compute resources by rack  
location. Otherwise you are dead on with the resource idea -- you can  
attach arbitrary resources to nodes that your users can make hard  
requests on.

The big issue for you is where you mention "...means moving the other  
jobs that users submitted to other nodes...."

This is not easy to make happen. By default Grid Engine will never  
mess with a running job --
the way Grid Engine makes policy based resource allocation happen is  
by manipulating the order of items waiting in the pending list.  It  
will not screw around with running jobs that have already been  
dispatched to nodes. { unless you explicitly configure it to do so ... }

So by default there is nothing in SGE that will "move jobs to  
different nodes" -- you'll have to make that happen yourself and it  
tends to be application specific in how this actually happens  
cleanly.  There are clear mechanisms for doing this (job migration /  
checkpoint / restart) but this is not something that is implicit,  
easy or automatic.

If you have the source code to these applications and you can  
implement checkpoint/restart features then you may be able to easily  
use the SGE migration features to bounce jobs from node to node. This  
would certainly give you the freedom you need but relatively few  
people are in a position where 100% of their cluster jobs are  
checkpoint-able and subject to seamless migration.

So you may be in for some difficulties when you are in a situation  
where there are running jobs already dispatched to the "big"  
resources (such as a rack of nodes) but you do have some  
opportunities for making these sorts of things happen with jobs that  
are still waiting for dispatch.

I'll mention some possibilities below that could be worth  
investigating but they fall well outside the realm of "what I've  
actually implemented myself" so take them with a grain of salt!

(1) you may be able to use the Grid Engine resource reservation and  
backfill mechanisms as a way to reserve entire racks for a set of  
jobs. This approach works best in areas where users are able to  
accurately predict the runtime their jobs need so that the backfill  
works efficiently.  The concept of resource reservation was invented  
(I think) to cover exactly these sorts of situations you are describing.

(2) Another option may be to investigate the urgency sub policy --  
there is a way to attach urgency values to resources such as "Rack_A"  
such that jobs requesting the resource end up getting a higher  
entitlement share which means that the pending list would be  
reorganized to boost the job higher in the list which means they  
would get first crack at Rack_A job slots as running jobs drained out.

Also you may want to read the official SGE 6.0x documenation  
available at this URL:

The various resource allocation policies are covered in far greater  
detail than the resource.html doc you referenced.


On May 11, 2005, at 4:57 PM, Jon Savian wrote:

> Hi Reuti,
> Thanks for your prompt response.  Users usually run scientific
> programs and request whatever resources they need for the job.  So
> yes, they specify runtime, memory, and number of slots needed.
> Users have expressed interest in running larger jobs that require 32
> nodes, containing 2 slots, and 2GB of memroy each.  However they would
> like jobs to be run on nodes contained in the same rack, instead of
> using nodes accross multiple racks.  We have multiple racks of 32
> nodes.  Hard requests will be needed i belive.
> So the first step i took was to specify a resource for one of the 32
> node racks.  So when a user does a "qsub -l resource_name....." It
> will run under the 32 nodes specified by it.  However other users
> might have already submitted jobs that are queued to run on some of
> the nodes we will need for our larger 32 node single rack job.  So
> ideally, i think we would want to find a way to make the the the
> single rack available so that the larger 32 node single rack job can
> run ASAP, which means moving the other jobs that users submitted to
> other nodes.  This may happen on a usual basis, so any kind of
> permanent setting for this would be great.
> I should also mention that I am making all modifications via qmon.
> Thanks.
> Jon
> They will be running a job on 32 nodes, each having 2GB memory, 2  
> slots/node.
> On 5/11/05, Reuti <reuti at staff.uni-marburg.de> wrote:
>> Hi Jon,
>> can you give more details: what exactly do you mean with small and  
>> large jobs?
>> The runtime, the memory request, the number of slots?
>> And: is resource2 a hard request for the small jobs?
>> Anyway: Two possibilities to look at are soft-requests (for  
>> resource1 for the
>> small jobs), or putting a sequence number on the nodes, so that  
>> resource1 nodes
>> are filled first.
>> Cheers - Reuti
>> Quoting Jon Savian <worknit at gmail.com>:
>>> Hi Everyone,
>>> I am trying to allocate resources on a cluster, so i followed the
>>> steps here:
>>> http://gridengine.sunsource.net/project/gridengine/howto/ 
>>> resource.html.
>>>  Lets say i created two resources, we'll call them resource1 and
>>>  resource2.  I want to be able to run large job using resource2,  
>>> but if
>>> there are a lot of smaller jobs queued to run on resource2 then the
>>> larger job will have to wait until the smaller ones execute.  Is  
>>> there
>>> any way to move smaller jobs from the nodes on resource2 and put  
>>> them
>>> on resource1 (or any other non-resource2 nodes for that matter) so
>>> that the larger job may run on resource2 ASAP?  Or even better, are
>>> there any priorities that can be set with the larger job that  
>>> will put
>>> it before the smaller ones?
>>> Thanks.
>>> Jon

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list