[GE users] asking about speeding up load reporting -- johnny layne

Reuti reuti at staff.uni-marburg.de
Fri Sep 21 11:02:45 BST 2007


Hi,

Am 20.09.2007 um 23:56 schrieb Johnny Layne:

>     I hope that I can get some info from folks experienced with  
> this.  On a cluster with 116 nodes, we're trying to set things so  
> that users don't oversubscribe memory.  I don't want to depend on  
> users adding "-l" options to their qsub scripts, they just won't do  
> it reliably.

if you make the complex h_vmem consumable and forced, they have to do  
it. And if they request too less, the job will be killed. You would  
just have to set h_vmem for each exec host to sensible value - I  
choose around installed memory minus 100M for old nodes with 1GB only  
and the physical installed memory nowadays 8GB upwards. In the worst  
case a little swapping won't hurt too much.

If you don't like it to be forced, setting a default value for the  
complex would be another option. And if it's not enough memory for a  
job, the users will request it the next time.

> So I've set aside a node just for me to play with and I've been  
> running some memory-intensive jobs & watching what happens as I  
> adjust & play with mem_free and other values for suspend  
> thresholds.  It works really great to set the mem_free value in  
> "Suspend Thresholds" to a value such that when my jobs get to using  
> too much memory the suspend script I wrote kicks in.

The default suspend isn't working for you?

>   All of that is fine.

But in this case more and more memory will be swapped to disk anyway  
(okay, only one time to get rid of the suspended job).

-- Reuti


>     Now what I'm wondering, how is it working for people trying to  
> speed up the reporting time of the load values?  I just changed  
> load_report_time for this 1-node queue to 20 seconds from the  
> default 40, and watched my jobs using top & other methods.  The  
> newest memory hog got suspended nicely & restarted OK as usual,  
> this time it happened (the suspend-restart-run process) very  
> cleanly and quickly compared to the 40 second report time.  That's  
> really nice!  However, I'm sure as I double the number of load  
> reports that occur, communication costs are getting a lot worse on  
> the cluster.  Have any of you experimented with this sort of  
> thing?  Do you have suggestions about how to test reporting time  
> values to find an optimal one beyond the trial & error method I'm  
> using?  Do you have any other suggestions about this sort of  
> thing?  Thanks a lot for any information!
>
>     Oh the cluster uses gigabit networking.
>     johnny
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list