[GE users] SGE resources and job queues.

Charu Chaubal Charu.Chaubal at Sun.COM
Wed May 11 23:20:06 BST 2005



Chris Dagdigian wrote:
> to add on to my last email ...
> 
> I realized I forgot one other sub policy / resource allocation  
> approach which may also work well for the "I need a full rack of  
> nodes for my job issue".
> 
> There is a sub policy that explicitly can increase the relative  
> entitlement of a job based on how long it has been stuck in the  
> pending queue awaiting it's turn for dispatch (or more often waiting  
> for a difficult set of hard resource requests to me matched)
> 
> It was explicitly invented to avoid "job starvation" whereby large  
> parallel jobs requesting many slots were pending forever because  
> little non-parallel short jobs were zipping in and out of the queues  
> leaving the scheduler with little time to find and hold on to the  
> number of slots the big pending parallel jobs needed.
> 
> For the life of me I can't remember the specific name of this policy  
> (it will be in the N1GE docs) but I think it is part of the Urgency  
> policy mechanism and the parameter you adjust is something like  
> "weight_wait_time" or something similar. You will need to test and  
> experiment with the value to find the proper settings as these sorts  
> of "algorithm adjustments" are not well covered in any SGE docs that  
> I've been able to find. Trial and error does work though.
> 

This is referred to as "weight_waiting_time" in the sched_conf man page
--- this is the actual name of the parameter.

However, in the sge_priority man page, unfortunately it's referred to as
"waiting_weight" --- these are actually one and the same.

Regards,
	Charu


> This type of approach is something that you would configure to  
> complement an existing set of policies. By itself it won't do much  
> but it will serve to provide an extra entitlement "boost" to the jobs  
> with complicated resource requests that may be languishing a bit in  
> the pending queue to the endless angst of the end user.
> 
> -Chris
> 
> 
> 
> On May 11, 2005, at 5:36 PM, Jon Savian wrote:
> 
> 
>>Luckily i won't need to disrupt already running jobs, just ones that
>>are waiting to run.
>>
>>Thanks.
>>
>>On 5/11/05, Chris Dagdigian <dag at sonsorol.org> wrote:
>>
>>
>>>Grid Engine 6.x has the concept of "hostgroups" which may be easier
>>>to set up if you want to group your compute resources by rack
>>>location. Otherwise you are dead on with the resource idea -- you can
>>>attach arbitrary resources to nodes that your users can make hard
>>>requests on.
>>>
>>>The big issue for you is where you mention "...means moving the other
>>>jobs that users submitted to other nodes...."
>>>
>>>This is not easy to make happen. By default Grid Engine will never
>>>mess with a running job --
>>>the way Grid Engine makes policy based resource allocation happen is
>>>by manipulating the order of items waiting in the pending list.  It
>>>will not screw around with running jobs that have already been
>>>dispatched to nodes. { unless you explicitly configure it to do  
>>>so ... }
>>>
>>>So by default there is nothing in SGE that will "move jobs to
>>>different nodes" -- you'll have to make that happen yourself and it
>>>tends to be application specific in how this actually happens
>>>cleanly.  There are clear mechanisms for doing this (job migration /
>>>checkpoint / restart) but this is not something that is implicit,
>>>easy or automatic.
>>>
>>>If you have the source code to these applications and you can
>>>implement checkpoint/restart features then you may be able to easily
>>>use the SGE migration features to bounce jobs from node to node. This
>>>would certainly give you the freedom you need but relatively few
>>>people are in a position where 100% of their cluster jobs are
>>>checkpoint-able and subject to seamless migration.
>>>
>>>So you may be in for some difficulties when you are in a situation
>>>where there are running jobs already dispatched to the "big"
>>>resources (such as a rack of nodes) but you do have some
>>>opportunities for making these sorts of things happen with jobs that
>>>are still waiting for dispatch.
>>>
>>>I'll mention some possibilities below that could be worth
>>>investigating but they fall well outside the realm of "what I've
>>>actually implemented myself" so take them with a grain of salt!
>>>
>>>(1) you may be able to use the Grid Engine resource reservation and
>>>backfill mechanisms as a way to reserve entire racks for a set of
>>>jobs. This approach works best in areas where users are able to
>>>accurately predict the runtime their jobs need so that the backfill
>>>works efficiently.  The concept of resource reservation was invented
>>>(I think) to cover exactly these sorts of situations you are  
>>>describing.
>>>
>>>(2) Another option may be to investigate the urgency sub policy --
>>>there is a way to attach urgency values to resources such as "Rack_A"
>>>such that jobs requesting the resource end up getting a higher
>>>entitlement share which means that the pending list would be
>>>reorganized to boost the job higher in the list which means they
>>>would get first crack at Rack_A job slots as running jobs drained  
>>>out.
>>>
>>>Also you may want to read the official SGE 6.0x documenation
>>>available at this URL:
>>>http://docs.sun.com/app/docs/coll/1017.3?q=N1GE
>>>
>>>The various resource allocation policies are covered in far greater
>>>detail than the resource.html doc you referenced.
>>>
>>>Regards,
>>>Chris
>>>
>>>
>>>On May 11, 2005, at 4:57 PM, Jon Savian wrote:
>>>
>>>
>>>
>>>>Hi Reuti,
>>>>
>>>>Thanks for your prompt response.  Users usually run scientific
>>>>programs and request whatever resources they need for the job.  So
>>>>yes, they specify runtime, memory, and number of slots needed.
>>>>
>>>>Users have expressed interest in running larger jobs that require 32
>>>>nodes, containing 2 slots, and 2GB of memroy each.  However they  
>>>>would
>>>>like jobs to be run on nodes contained in the same rack, instead of
>>>>using nodes accross multiple racks.  We have multiple racks of 32
>>>>nodes.  Hard requests will be needed i belive.
>>>>
>>>>So the first step i took was to specify a resource for one of the 32
>>>>node racks.  So when a user does a "qsub -l resource_name....." It
>>>>will run under the 32 nodes specified by it.  However other users
>>>>might have already submitted jobs that are queued to run on some of
>>>>the nodes we will need for our larger 32 node single rack job.  So
>>>>ideally, i think we would want to find a way to make the the the
>>>>single rack available so that the larger 32 node single rack job can
>>>>run ASAP, which means moving the other jobs that users submitted to
>>>>other nodes.  This may happen on a usual basis, so any kind of
>>>>permanent setting for this would be great.
>>>>
>>>>I should also mention that I am making all modifications via qmon.
>>>>
>>>>Thanks.
>>>>
>>>>Jon
>>>>
>>>>
>>>>They will be running a job on 32 nodes, each having 2GB memory, 2
>>>>slots/node.
>>>>
>>>>On 5/11/05, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>
>>>>
>>>>
>>>>>Hi Jon,
>>>>>
>>>>>can you give more details: what exactly do you mean with small and
>>>>>large jobs?
>>>>>The runtime, the memory request, the number of slots?
>>>>>
>>>>>And: is resource2 a hard request for the small jobs?
>>>>>
>>>>>Anyway: Two possibilities to look at are soft-requests (for
>>>>>resource1 for the
>>>>>small jobs), or putting a sequence number on the nodes, so that
>>>>>resource1 nodes
>>>>>are filled first.
>>>>>
>>>>>Cheers - Reuti
>>>>>
>>>>>
>>>>>Quoting Jon Savian <worknit at gmail.com>:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Hi Everyone,
>>>>>>
>>>>>>I am trying to allocate resources on a cluster, so i followed the
>>>>>>steps here:
>>>>>>http://gridengine.sunsource.net/project/gridengine/howto/
>>>>>>resource.html.
>>>>>> Lets say i created two resources, we'll call them resource1 and
>>>>>> resource2.  I want to be able to run large job using resource2,
>>>>>>but if
>>>>>>there are a lot of smaller jobs queued to run on resource2 then  
>>>>>>the
>>>>>>larger job will have to wait until the smaller ones execute.  Is
>>>>>>there
>>>>>>any way to move smaller jobs from the nodes on resource2 and put
>>>>>>them
>>>>>>on resource1 (or any other non-resource2 nodes for that matter) so
>>>>>>that the larger job may run on resource2 ASAP?  Or even better,  
>>>>>>are
>>>>>>there any priorities that can be set with the larger job that
>>>>>>will put
>>>>>>it before the smaller ones?
>>>>>>
>>>>>>Thanks.
>>>>>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
####################################################################
# Charu V. Chaubal              # Phone: (650) 786-7672 (x87672)   #
# Grid Computing Technologist   # Fax:   (650) 786-4591            #
# Sun Microsystems, Inc.        # Email: charu.chaubal at sun.com     #
####################################################################


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list