[GE users] Newbie: @group and fractional usage

David Kulp dkulp+sge at cs.umass.edu
Wed May 17 14:33:41 BST 2006


On May 17, 2006, at 7:07 AM, Reuti wrote:

> Hi David,
>
> Am 17.05.2006 um 07:03 schrieb David Kulp:
>
>> I have two questions as a new grid engine user.
>>
>> First, I'm running on linux and attempts to create a userlist with  
>> the @unixgroup notation doesn't seem to work.  qmon accepts it,  
>> but subsequent commands don't recognize it.  For example, I added  
>> "@group" to the deadlineusers userset, but when I try to submit  
>> deadline jobs I get an error 'job rejected: the user "dkulp" is no  
>> deadline initiation user'.  Deadline job submission only works  
>> when I add my explicit username in the deadlineusers userset.  But  
>> I don't want to do that for every new user.
>>
>>
>> Second, I would like to implement a usage policy that removes  
>> (reschedules/migrates) a user's jobs from running queues if the  
>> user is
>
> by default this policy isn't implemented in SGE. Although if there  
> would be such a policy, it would be hard to decide which of the  
> running jobs to kill from any user.
>
>> currently exceeding his fractional share and there is a demand for  
>> resources.  I've set up a share tree, which works well when all  
>> running jobs are short.  However, we want a policy that preempts  
>> running programs according to that share tree policy.
>>
>> I would think that our scenario is common, but I haven't found  
>> anything on this.  Our compute cluster is fractionally owned by  
>> multiple groups; that is, different groups have contributed  
>> nodes.  Usage is bursty, but jobs some times can run for days.   
>> Suppose Alice and Bob each own 50% of the cluster.  Initially the  
>> cluster is idle, so when Alice submits her jobs they fill up all  
>> the queues for 100% utilization.  Then Bob wants to run his jobs.   
>> If Alice's jobs are short, then the share tree policy would  
>> quickly balance out the resource usage to 50-50.  But if Alice's  
>> jobs run for days, then Bob is stuck waiting.  Alice and Bob would  
>> prefer if Alice's job was just terminated (or checkpointed) and  
>> rescheduled.
>
> What you can try is to use a script running periodically, which  
> parses the qstat output and suspend some jobs if it discovers that  
> some other user's jobs should run. Having checkpointing defined,  
> the suspend will checkpoint and reschedule the job or you could  
> reschedule them directly on your own.

Yes.  I had considered making a custom load sensor that would use  
qstat and qconf to determine whether there were jobs waiting for a  
long while from a user with an inadequate current share, and if so,  
reschedule some running jobs (via suspend-under-checkpointing-causes- 
reschedule).  But that would be too blunt of an approach since the  
queue wouldn't be able to determine which jobs to kill, as you say  
above.

It would seem like a nice extension to SGE to have the functionality  
of a script like you describe.  Ideally, jobs would be selected for  
rescheduling based on the user's actual usage compared to their  
allocated share as well as the time that the jobs had been running.   
The interval would be adjustable in the same way as current load  
thresholds.

This problem all boils down to long running jobs.  If all jobs were  
short, then the share tree solution would be fine.

-d

>
> HTH - Reuti
>
>
>>
>> The only solution that I can think of is to create two queues for  
>> every host, one queue for Alice and one for Bob.  On 50% of the  
>> hosts the Alice queue will be subordinate to Bob.  Vice versa on  
>> the other half.  But this requires a lot of manual queue  
>> configuration as the number of cluster owners increases.  It would  
>> be nice if there were some more general scheme like the share  
>> tree.  In other words, I would like the share tree to effect  
>> preemption policy.  Any ideas?
>>
>> Thanks in advance.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list