[GE users] SGE resources and job queues.

Chris Dagdigian dag at sonsorol.org
Wed May 11 22:46:25 BST 2005

to add on to my last email ...

I realized I forgot one other sub policy / resource allocation  
approach which may also work well for the "I need a full rack of  
nodes for my job issue".

There is a sub policy that explicitly can increase the relative  
entitlement of a job based on how long it has been stuck in the  
pending queue awaiting it's turn for dispatch (or more often waiting  
for a difficult set of hard resource requests to me matched)

It was explicitly invented to avoid "job starvation" whereby large  
parallel jobs requesting many slots were pending forever because  
little non-parallel short jobs were zipping in and out of the queues  
leaving the scheduler with little time to find and hold on to the  
number of slots the big pending parallel jobs needed.

For the life of me I can't remember the specific name of this policy  
(it will be in the N1GE docs) but I think it is part of the Urgency  
policy mechanism and the parameter you adjust is something like  
"weight_wait_time" or something similar. You will need to test and  
experiment with the value to find the proper settings as these sorts  
of "algorithm adjustments" are not well covered in any SGE docs that  
I've been able to find. Trial and error does work though.

This type of approach is something that you would configure to  
complement an existing set of policies. By itself it won't do much  
but it will serve to provide an extra entitlement "boost" to the jobs  
with complicated resource requests that may be languishing a bit in  
the pending queue to the endless angst of the end user.


On May 11, 2005, at 5:36 PM, Jon Savian wrote:

> Luckily i won't need to disrupt already running jobs, just ones that
> are waiting to run.
> Thanks.
> On 5/11/05, Chris Dagdigian <dag at sonsorol.org> wrote:
>> Grid Engine 6.x has the concept of "hostgroups" which may be easier
>> to set up if you want to group your compute resources by rack
>> location. Otherwise you are dead on with the resource idea -- you can
>> attach arbitrary resources to nodes that your users can make hard
>> requests on.
>> The big issue for you is where you mention "...means moving the other
>> jobs that users submitted to other nodes...."
>> This is not easy to make happen. By default Grid Engine will never
>> mess with a running job --
>> the way Grid Engine makes policy based resource allocation happen is
>> by manipulating the order of items waiting in the pending list.  It
>> will not screw around with running jobs that have already been
>> dispatched to nodes. { unless you explicitly configure it to do  
>> so ... }
>> So by default there is nothing in SGE that will "move jobs to
>> different nodes" -- you'll have to make that happen yourself and it
>> tends to be application specific in how this actually happens
>> cleanly.  There are clear mechanisms for doing this (job migration /
>> checkpoint / restart) but this is not something that is implicit,
>> easy or automatic.
>> If you have the source code to these applications and you can
>> implement checkpoint/restart features then you may be able to easily
>> use the SGE migration features to bounce jobs from node to node. This
>> would certainly give you the freedom you need but relatively few
>> people are in a position where 100% of their cluster jobs are
>> checkpoint-able and subject to seamless migration.
>> So you may be in for some difficulties when you are in a situation
>> where there are running jobs already dispatched to the "big"
>> resources (such as a rack of nodes) but you do have some
>> opportunities for making these sorts of things happen with jobs that
>> are still waiting for dispatch.
>> I'll mention some possibilities below that could be worth
>> investigating but they fall well outside the realm of "what I've
>> actually implemented myself" so take them with a grain of salt!
>> (1) you may be able to use the Grid Engine resource reservation and
>> backfill mechanisms as a way to reserve entire racks for a set of
>> jobs. This approach works best in areas where users are able to
>> accurately predict the runtime their jobs need so that the backfill
>> works efficiently.  The concept of resource reservation was invented
>> (I think) to cover exactly these sorts of situations you are  
>> describing.
>> (2) Another option may be to investigate the urgency sub policy --
>> there is a way to attach urgency values to resources such as "Rack_A"
>> such that jobs requesting the resource end up getting a higher
>> entitlement share which means that the pending list would be
>> reorganized to boost the job higher in the list which means they
>> would get first crack at Rack_A job slots as running jobs drained  
>> out.
>> Also you may want to read the official SGE 6.0x documenation
>> available at this URL:
>> http://docs.sun.com/app/docs/coll/1017.3?q=N1GE
>> The various resource allocation policies are covered in far greater
>> detail than the resource.html doc you referenced.
>> Regards,
>> Chris
>> On May 11, 2005, at 4:57 PM, Jon Savian wrote:
>>> Hi Reuti,
>>> Thanks for your prompt response.  Users usually run scientific
>>> programs and request whatever resources they need for the job.  So
>>> yes, they specify runtime, memory, and number of slots needed.
>>> Users have expressed interest in running larger jobs that require 32
>>> nodes, containing 2 slots, and 2GB of memroy each.  However they  
>>> would
>>> like jobs to be run on nodes contained in the same rack, instead of
>>> using nodes accross multiple racks.  We have multiple racks of 32
>>> nodes.  Hard requests will be needed i belive.
>>> So the first step i took was to specify a resource for one of the 32
>>> node racks.  So when a user does a "qsub -l resource_name....." It
>>> will run under the 32 nodes specified by it.  However other users
>>> might have already submitted jobs that are queued to run on some of
>>> the nodes we will need for our larger 32 node single rack job.  So
>>> ideally, i think we would want to find a way to make the the the
>>> single rack available so that the larger 32 node single rack job can
>>> run ASAP, which means moving the other jobs that users submitted to
>>> other nodes.  This may happen on a usual basis, so any kind of
>>> permanent setting for this would be great.
>>> I should also mention that I am making all modifications via qmon.
>>> Thanks.
>>> Jon
>>> They will be running a job on 32 nodes, each having 2GB memory, 2
>>> slots/node.
>>> On 5/11/05, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>> Hi Jon,
>>>> can you give more details: what exactly do you mean with small and
>>>> large jobs?
>>>> The runtime, the memory request, the number of slots?
>>>> And: is resource2 a hard request for the small jobs?
>>>> Anyway: Two possibilities to look at are soft-requests (for
>>>> resource1 for the
>>>> small jobs), or putting a sequence number on the nodes, so that
>>>> resource1 nodes
>>>> are filled first.
>>>> Cheers - Reuti
>>>> Quoting Jon Savian <worknit at gmail.com>:
>>>>> Hi Everyone,
>>>>> I am trying to allocate resources on a cluster, so i followed the
>>>>> steps here:
>>>>> http://gridengine.sunsource.net/project/gridengine/howto/
>>>>> resource.html.
>>>>>  Lets say i created two resources, we'll call them resource1 and
>>>>>  resource2.  I want to be able to run large job using resource2,
>>>>> but if
>>>>> there are a lot of smaller jobs queued to run on resource2 then  
>>>>> the
>>>>> larger job will have to wait until the smaller ones execute.  Is
>>>>> there
>>>>> any way to move smaller jobs from the nodes on resource2 and put
>>>>> them
>>>>> on resource1 (or any other non-resource2 nodes for that matter) so
>>>>> that the larger job may run on resource2 ASAP?  Or even better,  
>>>>> are
>>>>> there any priorities that can be set with the larger job that
>>>>> will put
>>>>> it before the smaller ones?
>>>>> Thanks.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list