[GE users] SGE resources and job queues.

Charu Chaubal Charu.Chaubal at Sun.COM
Wed May 11 23:20:06 BST 2005

Chris Dagdigian wrote:
> to add on to my last email ...
> I realized I forgot one other sub policy / resource allocation  
> approach which may also work well for the "I need a full rack of  
> nodes for my job issue".
> There is a sub policy that explicitly can increase the relative  
> entitlement of a job based on how long it has been stuck in the  
> pending queue awaiting it's turn for dispatch (or more often waiting  
> for a difficult set of hard resource requests to me matched)
> It was explicitly invented to avoid "job starvation" whereby large  
> parallel jobs requesting many slots were pending forever because  
> little non-parallel short jobs were zipping in and out of the queues  
> leaving the scheduler with little time to find and hold on to the  
> number of slots the big pending parallel jobs needed.
> For the life of me I can't remember the specific name of this policy  
> (it will be in the N1GE docs) but I think it is part of the Urgency  
> policy mechanism and the parameter you adjust is something like  
> "weight_wait_time" or something similar. You will need to test and  
> experiment with the value to find the proper settings as these sorts  
> of "algorithm adjustments" are not well covered in any SGE docs that  
> I've been able to find. Trial and error does work though.

This is referred to as "weight_waiting_time" in the sched_conf man page
--- this is the actual name of the parameter.

However, in the sge_priority man page, unfortunately it's referred to as
"waiting_weight" --- these are actually one and the same.


> This type of approach is something that you would configure to  
> complement an existing set of policies. By itself it won't do much  
> but it will serve to provide an extra entitlement "boost" to the jobs  
> with complicated resource requests that may be languishing a bit in  
> the pending queue to the endless angst of the end user.
> -Chris
> On May 11, 2005, at 5:36 PM, Jon Savian wrote:
>>Luckily i won't need to disrupt already running jobs, just ones that
>>are waiting to run.
>>On 5/11/05, Chris Dagdigian <dag at sonsorol.org> wrote:
>>>Grid Engine 6.x has the concept of "hostgroups" which may be easier
>>>to set up if you want to group your compute resources by rack
>>>location. Otherwise you are dead on with the resource idea -- you can
>>>attach arbitrary resources to nodes that your users can make hard
>>>requests on.
>>>The big issue for you is where you mention "...means moving the other
>>>jobs that users submitted to other nodes...."
>>>This is not easy to make happen. By default Grid Engine will never
>>>mess with a running job --
>>>the way Grid Engine makes policy based resource allocation happen is
>>>by manipulating the order of items waiting in the pending list.  It
>>>will not screw around with running jobs that have already been
>>>dispatched to nodes. { unless you explicitly configure it to do  
>>>so ... }
>>>So by default there is nothing in SGE that will "move jobs to
>>>different nodes" -- you'll have to make that happen yourself and it
>>>tends to be application specific in how this actually happens
>>>cleanly.  There are clear mechanisms for doing this (job migration /
>>>checkpoint / restart) but this is not something that is implicit,
>>>easy or automatic.
>>>If you have the source code to these applications and you can
>>>implement checkpoint/restart features then you may be able to easily
>>>use the SGE migration features to bounce jobs from node to node. This
>>>would certainly give you the freedom you need but relatively few
>>>people are in a position where 100% of their cluster jobs are
>>>checkpoint-able and subject to seamless migration.
>>>So you may be in for some difficulties when you are in a situation
>>>where there are running jobs already dispatched to the "big"
>>>resources (such as a rack of nodes) but you do have some
>>>opportunities for making these sorts of things happen with jobs that
>>>are still waiting for dispatch.
>>>I'll mention some possibilities below that could be worth
>>>investigating but they fall well outside the realm of "what I've
>>>actually implemented myself" so take them with a grain of salt!
>>>(1) you may be able to use the Grid Engine resource reservation and
>>>backfill mechanisms as a way to reserve entire racks for a set of
>>>jobs. This approach works best in areas where users are able to
>>>accurately predict the runtime their jobs need so that the backfill
>>>works efficiently.  The concept of resource reservation was invented
>>>(I think) to cover exactly these sorts of situations you are  
>>>(2) Another option may be to investigate the urgency sub policy --
>>>there is a way to attach urgency values to resources such as "Rack_A"
>>>such that jobs requesting the resource end up getting a higher
>>>entitlement share which means that the pending list would be
>>>reorganized to boost the job higher in the list which means they
>>>would get first crack at Rack_A job slots as running jobs drained  
>>>Also you may want to read the official SGE 6.0x documenation
>>>available at this URL:
>>>The various resource allocation policies are covered in far greater
>>>detail than the resource.html doc you referenced.
>>>On May 11, 2005, at 4:57 PM, Jon Savian wrote:
>>>>Hi Reuti,
>>>>Thanks for your prompt response.  Users usually run scientific
>>>>programs and request whatever resources they need for the job.  So
>>>>yes, they specify runtime, memory, and number of slots needed.
>>>>Users have expressed interest in running larger jobs that require 32
>>>>nodes, containing 2 slots, and 2GB of memroy each.  However they  
>>>>like jobs to be run on nodes contained in the same rack, instead of
>>>>using nodes accross multiple racks.  We have multiple racks of 32
>>>>nodes.  Hard requests will be needed i belive.
>>>>So the first step i took was to specify a resource for one of the 32
>>>>node racks.  So when a user does a "qsub -l resource_name....." It
>>>>will run under the 32 nodes specified by it.  However other users
>>>>might have already submitted jobs that are queued to run on some of
>>>>the nodes we will need for our larger 32 node single rack job.  So
>>>>ideally, i think we would want to find a way to make the the the
>>>>single rack available so that the larger 32 node single rack job can
>>>>run ASAP, which means moving the other jobs that users submitted to
>>>>other nodes.  This may happen on a usual basis, so any kind of
>>>>permanent setting for this would be great.
>>>>I should also mention that I am making all modifications via qmon.
>>>>They will be running a job on 32 nodes, each having 2GB memory, 2
>>>>On 5/11/05, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>Hi Jon,
>>>>>can you give more details: what exactly do you mean with small and
>>>>>large jobs?
>>>>>The runtime, the memory request, the number of slots?
>>>>>And: is resource2 a hard request for the small jobs?
>>>>>Anyway: Two possibilities to look at are soft-requests (for
>>>>>resource1 for the
>>>>>small jobs), or putting a sequence number on the nodes, so that
>>>>>resource1 nodes
>>>>>are filled first.
>>>>>Cheers - Reuti
>>>>>Quoting Jon Savian <worknit at gmail.com>:
>>>>>>Hi Everyone,
>>>>>>I am trying to allocate resources on a cluster, so i followed the
>>>>>>steps here:
>>>>>> Lets say i created two resources, we'll call them resource1 and
>>>>>> resource2.  I want to be able to run large job using resource2,
>>>>>>but if
>>>>>>there are a lot of smaller jobs queued to run on resource2 then  
>>>>>>larger job will have to wait until the smaller ones execute.  Is
>>>>>>any way to move smaller jobs from the nodes on resource2 and put
>>>>>>on resource1 (or any other non-resource2 nodes for that matter) so
>>>>>>that the larger job may run on resource2 ASAP?  Or even better,  
>>>>>>there any priorities that can be set with the larger job that
>>>>>>will put
>>>>>>it before the smaller ones?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

# Charu V. Chaubal              # Phone: (650) 786-7672 (x87672)   #
# Grid Computing Technologist   # Fax:   (650) 786-4591            #
# Sun Microsystems, Inc.        # Email: charu.chaubal at sun.com     #

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list