[GE users] A good spec for a GridEngine 6.1 QMaster

Andy Schwierskott andy.schwierskott at sun.com
Fri Jul 18 14:52:57 BST 2008


> I've been asked to buy a new server specifically to run QMaster on to make
> it as stable as possible and I was wondering if anyone could recommend
> hardware / operating system combinations.
> We've recently decided to migrate to Grid Engine v6.1 and our v6.1 tests so
> far show that it's a lot more stable than the v5 versions we've been using
> up until now.  We have approximately 60 execution hosts, each with 2x 3GHz
> Intel Xeon CPUs and 4GB or RAM.  We tend to run 4 jobs on an execution host
> at once resulting in 4 x 60 nodes = 240 nodes.  We are looking to expand
> this to approximately 75 execution hosts in the very near future. 300 nodes.
> Most jobs last hours, but we have a few jobs that last only 15 minutes.
> Also the grid can remain empty for days without being used, but at other
> times the grid can be maxed out with a queue of perhaps 1000 jobs or more.
> We used to run v5 on a RedHat8 OS and it used to lock up and crash
> regularly. Initially I did our v6.1 testing on an old SUN Netra T1 105
> 440MHz with 512MB RAM, running Solaris 10, but it proved to be too slow.  I
> was running it with all the execution hosts writing their logs back to that
> machine over an NFS share, which perhaps is what caused the performance
> problems?
> I then turned an execution host into the QMaster (2x 3GHz Intel Xeon CPUs
> and 4GB or RAM), running openSuse10.0 which is perhaps way more power than
> required.  Around this time I was advised on this mailing list to store the
> execution host logs locally on each execution host.  This setup performed
> without any problems.

For your workload this config is certainly more than well suited. Grid
Engine is not truly low on its memory requirements, but 4GB for a few
thousands jobs should be a no brainer. Just make sure that other apps don't
eat u pthe memory: As soon as SGE processes start swapping you have
practically no more reliable SGE master/scheduler.

> I've been told in the past that SUN hardware and the Solaris OS are more
> stable that Intel hardware an Linux, but is this still the case?  Does the
> Grid Engine QMaster run any better on Solaris than Linux, or are there
> preferred distros of Linux?  Also I'm told that rack mount servers are
> generally more stable than towers and that HP gear is more reliable than
> Dell.

Since Sun made the agreement with Intel to build Intel based servers last
year, our colleagues in the Solaris organization started lots of efforts to
improve Solaris for Intel x86 hardware (so far we had just AMD Opteron based
servers from the x86 world). There have been impressive improvements made
meanwhile, however I believe that SGE itself on Solaris 10 will run evenly
well on AMD or Intel hardware.

A surprising thing for us is that SGE has a visibly (i.e. measurable) more
scalable behavior in Solaris 10 (x64) than on Linux. Reasons are for example
the horribly bad malloc library on Linux (where SGE suffers especially with
SGE 6.2) or worse behavior with the network stack and the SGE communication
library. Anyone knowing our source code will certaily agree that we've made
no specific Solaris optimization (or Linux de-optimizations) in our code to
give SGE a boost on Solaris :-)

For your use case Linux would do certainly perform well. But it's definitely
a safe bet to go with SGE on Solaris 10 on a x64 Sun server. In our lab we
use Sun X4100 servers on Solaris and Linux to do our performance benchmark


> Far too many permutations to test / evaluate so I was wondering if anyone
> has had any success (or horror) stories to help me choose.
> Thanks in advance for any help you may be able to offer.
> Regards
> Neil

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list