[GE users] asking about speeding up load reporting -- johnny layne

Johnny Layne laynejg at vcu.edu
Tue Sep 25 22:07:13 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

hi all,
    This will probably be my last update on this so I can quit spamming 
the list, but I did want to share results.

    I re-wrote my memory hog job (it's an image processing code I wrote 
just to play with bitmaps) to make them childishly parallel (1/2 the 
array of bitmaps to one node for "processing", 1/2 to the other).  I 
added another node to my memq and ran the job, afraid that when memory 
usage got too large on one node, the other would merrily keep on 
chugging somehow.  Nope!  I'm delighted to say it worked just as well 
under these much more severe circumstances, it is really neat sitting 
and watching these jobs run.  The pattern was this:  launch job, watch 
memory usage on both nodes via top, watch job go into suspended state 
via qstat, watch memory used on nodes drop (well, if it doesn't for some 
reason you're stuck...), watch jobs resume when enough memory is 
available.  Perfect.

     Now I'm sure under some extreme circumstances, some job can gobble 
memory so fast it can't be caught in time, but that's going to be 
really, really rare for us & what's generally done here, and it's always 
the case anyway that something could break something.  So anyway, to 
finally sum up, it seems this is a really "good enough" solution to the 
high memory jobs that have been worrying us, even if parallel ones come 
along.
    johnny

> hi all,
> Just an update on my experiences playing with mem_free and the 
> load_report_time value. I changed my job so that the memory hogging 
> occurs more gradually, so that after about 5 minutes the node is 
> really getting hammered by 3 or 4 of these jobs. Using mem_free as a 
> Suspend Threshold with a value of 1.5G (well I used K units actually) 
> on my node with 4G RAM, the jobs were managed beautifully, this worked 
> even better than when I ran the jobs so the memory grabbing occured at 
> the beginning. This is much more like real jobs we run, so this really 
> made me happy. In fact I'm going to set the load_report_time back to 
> 40s and see how I like that.
>
> Yes Reuti the default suspend method worked great, everything worked 
> great in fact. I'm going to keep running a few more tests then grab 
> some more nodes for my testing, see how that goes too before we hit 
> the whole cluster with this, but I'm already pretty sure that this is 
> what we were looking for.
> johnny
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list