[GE users] Re: Immediate Job Suspension Experiment

Clements, Brent M (SAIC) clembm at bp.com
Thu Oct 6 19:21:55 BST 2005


You can accomplish what you want by using a more dynamic and flexible
scheduler such as Maui(www.clusterresources.com) in conjunction with
SGE.

You would just use SGE as the execution system, while the scheduling
part would be handled by Maui

Brent
 

-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Thursday, October 06, 2005 4:31 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Re: Immediate Job Suspension Experiment

Hi,

Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
> Hi Mike,
> 
> thanks for the assistance. Though, it is just about convenience. :-) 
> Sure, it would be a RFE. Would you file one with a good description of

> the use case and how you would like to have it implemented?

there is already an RFE to suspend slots instead of queue:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1245

In some way it's related. If I got it right here, the new feature would
be something like a slot-share policy with immediate effect. - Reuti


> Thank you very much,
> Stephan
> 
> Mike Brown wrote:
> 
>> I realize this is a couple weeks back, but I'd like to add to this 
>> issue.  What Steve is asking for is exactly what we try to accomplish

>> at our site.  Users want immediate use of "their" resources.
>> Stephan, to answer your question, I think he is saying he needs to 
>> configure a host group with specific hosts (less convenient) instead 
>> of using a shorthand to say 'give N hosts to group A' (more 
>> convenient).  In my configuration, and possibly his, the hosts have
no 
>> meaningful differences, and thus no need to be grouped by name.
>>
>> At our site, I set up group and common queues, where the group queues

>> always suspend the common queues.  But I need to manually name the 
>> machines that belong to each group.
>>
>> There are benefits to what he is suggesting.  Imagine 2 group queues 
>> with a subordinate common queue.  Currently, if the common queue is 
>> 50% full, and group A wants to submit jobs to their group queue, some

>> portion of the jobs in the common queue will be suspended, because 
>> group A's queue is defined in terms of fixed machines.  Using his 
>> suggestion, group A would have N/2 machines, and avoid suspending any

>> jobs in the common queue.
>> I realize it may be difficult to implement, but I think it is 
>> desirable behavior.  In LSF, I think you could define number of jobs 
>> per group per queue.  Could this be accomplished as an enhancement?
>> Mike
>>
>>
>> Steve Pittard wrote:
>>
>>> Hi, I posted the question (see below) about getting SGE to
immediately
>>> suspend jobs in favor of jobs from another user who was entitled to
>>> a predetermined share of the cluster. Based on responses I don't
think
>>> that this is possible though I did experiment with the following 
>>> scenario
>>> that sort of emulates it. I'm using SGE 6 in this example.
>>>
>>> So I have the all.q which is a cluster queue containing all nodes
>>> and anyone can submit to this. I created a depta.q cluster queue
>>> which "overlayed" some of the nodes specified in the all.q. That
>>> is depta.q contains some of the nodes that all.q does. I then made
>>> all.q a subordinate queue to depta.q. So I kicked off lots of jobs
>>> into all.q and let them run for a while. Then I submitted some jobs
to
>>> depta.q and it suspended the jobs running on all.q hosts that were
>>> also contained in depta.q leaving other all.q jobs running. So in
some
>>> sense this is close to what I want though I still would have to make
>>> reference to specific nodes when allocating resources for a specific
>>> department.
>>
>>
>>
>> Sorry, I am a bit lost. What do you need to specify? Could you give
me
>> some more details?
>>
>> You can use access-lists to restrict the use of depta.q to the depta 
>> users.
>> You can use the queue sorting by seq. no to ensure that the depta.q
is 
>> always
>> used before the all.q
>>
>>
>> What else is missing?
>>
>> Kind Regards,
>> Stephan
>>
>>>
>>>
>>>
>>> On Wednesday, September 14, 2005, at 06:49 PM, Steve Pittard wrote:
>>>
>>>>>
>>>>>> I've read and implemented some of the concepts described
>>>>>> relative to sharing and functional policies that permit
>>>>>> proportional/weighted use of the cluster in situations where
>>>>>> departments share the cluster in a predetermined way (
>>>>>> e.g. dept A gets 60%, dept B gets 25%, and 15% is up for
>>>>>> grabs by common users (users who didn't buy into the cluster).
>>>>>> And anyone gets %100 if the other department is not using it).
>>>>>>
>>>>>> Okay so the big concern my users have is that the functional
>>>>>> and share mechanisms guarantee the proportion "over time" whereas
>>>>>> users (at least mine) want to see immediate suspension of the
other
>>>>>> department's jobs if those jobs are occupying more than their 
>>>>>> guaranteed
>>>>>> proportion.
>>>>>>
>>>>>> In some cases this is legitimate as some of the jobs we have run
>>>>>> for weeks and it is of little consolation that your jobs float to
the
>>>>>> top of the pending list since you still have to wait for the
>>>>>> running jobs to complete at least from what I've seen.
>>>>>>
>>>>>> The subordinate queue mechanism does seem to have some promise
>>>>>> though what is necessary to implement subordinate queues to
>>>>>> insure predetermined proportional use that provides for
>>>>>> immediate suspension of jobs running on someone's slice of
>>>>>> the cluster ?
>>>>>>
>>>>>> So in my case lets say Dept A expects to get 60%. So lets assume
>>>>>> no one is running jobs so dept A is getting all the nodes. A day
>>>>>> later Dept B shows up ,submits some jobs and expects to get their
>>>>>> 25% slice immediately but doesn't want to wait for the dept A
jobs 
>>>>>> to drain
>>>>>> before that happens. So they would expect to see some number of 
>>>>>> Dept A
>>>>>> jobs suspend. And it should work the other way also. If dept B 
>>>>>> jumps on
>>>>>> the cluster and is using all resources, then Dept A shows up and 
>>>>>> wants
>>>>>> their 60% *immediately*.  Are there ways to do this ? I've used 
>>>>>> LSF like
>>>>>> this previously.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail:
users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>>
>>>>>
---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>>
>>>
---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list