[GE users] how to throttle jobs into a queue

Daniel Templeton Dan.Templeton at Sun.COM
Fri Aug 24 18:42:34 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

David,

Are you using 6.1?  I just tried the same thing with my 6.1 cluster, and 
it also had no effect.  I tried the same thing with my 6.0u10 cluster 
and it worked.  I'm now downloading the latest 6.1u2 binaries, to try it 
there as well.  I don't see an issue listed for the problem, but it may 
have been fixed in an update release nonetheless.

Daniel

david zanella wrote:
> I agree that this will probably work, but it isn't exactly what I"m looking 
> for. 
>
> In my case, the users are submitting several thousand jobs at a time. They 
> cannot predict (or don't want to take the time to) how much memory a job will 
> use. If they flag each job as using 2G of memory, then the consumable resource 
> will run out at 15 or 16 jobs. Using my current load thresholds, I'm getting 
> 22-27 jobs on each server. I lose a lot of throughput if I do this. 
>
> Using qconf -msconf and changing job_load_adjustments from
> np_load_avg=0.5 to np_load_avg=2.0 with a load_adjustment_decay_time of
> 15 minutes *SHOULD* do it (man sched_conf)...but it doesn't seem to be
> having any affect. That is, upon each job submission, it should
> artificially increase the np_load_avg to 2.0 (alarm is set at 1.75) and
> then decay that setting down for 15 minutes. That should give the job
> enough time to ramp up and start using memory and trip my memory and
> swap triggers. 
>
>
>
>
> ------------- Begin Forwarded Message -------------
>
> From: "Kogan, Felix" <Felix-Kogan at deshaw.com>
> To: <users at gridengine.sunsource.net>
> Subject: RE: [GE users] how to throttle jobs into a queue
> Content-Transfer-Encoding: 8bit
> X-MIME-Autoconverted: from quoted-printable to 8bit by hsrnfs-101.mayo.edu id 
> l7OG83F27145
>
> I've had the same problem and came up with the following solution (still
> in testing phase):
>
> o Make mem_free a requestable and consumable attribute
>
> 	$ qconf -sc
> 	#name                                    shortcut
> type        relop requestable consumable default  urgency
> 	
> #-----------------------------------------------------------------------
> --------------------------------------------------------------------
> 	...
> 	mem_free                                 mf
> MEMORY      <=    YES         YES        0        0
> 	...
>
> o Set the resource value to the real amount of RAM for each node
>  
> 	qconf -mattr exechost complex_values mem_free=32G
> hostname.foo.bar.com
>
> Once this is done, users can use "-l mem_free=2G" to really reserve 2GB
> of RAM. Mem_free reading of the host where this job is executed will
> show 2GB less mem_free. If the job, in fact consumed 2.5GB, mem_free
> will reflect that. I.e. SGE uses the smaller of two values - calculated
> from internal accounting and received from the load sensor. This works
> for all other standard or custom requestable and consumable attributes,
> as long as custom load sensor is set for these (e.g. you can set this up
> for /var/tmp space).
>
>
> Hope that helps.
>
> --
> Felix Kogan
>
> -----Original Message-----
> From: david zanella [mailto:zanella at mayo.edu] 
> Sent: Friday, August 24, 2007 11:46 AM
> To: users at gridengine.sunsource.net
> Subject: [GE users] how to throttle jobs into a queue
>
>
> I have a group of users that are submitting jobs to my grid.  The jobs
> do some sort of pedigree/chromosome calculations. It is impossible for
> the user to predict or control the amount of memory for each job.
> Consequently, some job will start out small and grow to be about 2G in
> size and run for weeks, other jobs can be small as a few hundred meg
> and finish up in an hour.
>
> I have set up load thresholds that will suspend job submission if the
> available mem_free < 2G or swap_used > 6G.  For the most part, this
> works well.  I have 7 T2000's for execute hosts.
>
> Here's the problem:
>
> My T2000's have 32G of memory and I have 30 slots for each. With the
> load thresholds in place, say the server is only running 20 jobs. A job
> completes and the server is now below it's load threshold. The qmaster
> sees this and immediately shoves 11 jobs at the server.  Pretty soon,
> the jobs grow, I run out of memory and swap, and jobs start crashing.
>
> What I need is some way to throttle the acceptance rate to the server.
> To tell the server to accept one job, then re-evaluate in, say, 15 or
> 30 minutes. If the load thresholds give a green light, it'll accept
> another job.
>
> I've looked at sched_conf, and it has what appears to be what I need.
> I've made various adjustments to job_load_adjustments and
> load_adjustment_decay_time, but these have not had any effect.
>
> Am I missing something? Is there a better way to accomplish what I'm
> trying to do?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ------------- End Forwarded Message -------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list