[GE users] A good spec for a GridEngine 6.1 QMaster

Joe Landman landman at scalableinformatics.com
Fri Jul 18 14:29:33 BST 2008

    [ The following text is in the "windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Neil Baker wrote:
> Hi Grid users,
> I?ve been asked to buy a new server specifically to run QMaster on to 
> make it as stable as possible and I was wondering if anyone could 
> recommend hardware / operating system combinations. 

For simplicity/sanity sake, try to keep the OS on the QMaster the same 
as on the execution hosts if possible.  One less boundary to deal with.


> We used to run v5 on a RedHat8 OS and it used to lock up and crash 
> regularly. Initially I did our v6.1 testing on an old SUN Netra T1 105 

That was a feature ... er ... bug ... of RH8

> 440MHz with 512MB RAM, running Solaris 10, but it proved to be too 
> slow.  I was running it with all the execution hosts writing their logs 
> back to that machine over an NFS share, which perhaps is what caused the 
> performance problems?

Likely.  NFS can be incredibly intensive.

> I then turned an execution host into the QMaster (2x 3GHz Intel Xeon 
> CPUs and 4GB or RAM), running openSuse10.0 which is perhaps way more 
> power than required.  Around this time I was advised on this mailing 

We have run large clusters from relatively light qmasters.  The issue is 
usually one of running out of memory and then network bandwidth.

> list to store the execution host logs locally on each execution host.  
> This setup performed without any problems.
> I?ve been told in the past that SUN hardware and the Solaris OS are more 
> stable that Intel hardware an Linux, but is this still the case?  Does 

No, it never really was the case.  We have seen as many problems on RISC 
based systems as we have on CISC systems.  The OSes also all crash.  We 
have Linux units up for 550+ days, and Solaris boxes that can't deal 
with a bonnie++ run and crash hard.  Others may have different mixes and 
experiences, but the net of it is that all units have bugs, and problems.

As a point of simplification of management, it makes sense to have the 
QMaster be the same OS as the execution systems if possible.  If 
everthing is Linux, stay with the same version of Linux.  If everything 
is IRIX, Solaris, HPUX, AIX, ...  stay with that.  It complicates 
administration to do otherwise.

What I would suggest is a single socket system, and lots of RAM.  RAM is 
inexpensive these days, get the ECC flavor.  Get a quad core CPU (its 
hard to find single core units, dual core and quad core have similar costs).

> the Grid Engine QMaster run any better on Solaris than Linux, or are 
> there preferred distros of Linux?  Also I?m told that rack mount servers 
> are generally more stable than towers and that HP gear is more reliable 
> than Dell.

Hmmm... I hear lots of biases being fed to you, and not really much 
data.  IMO Solaris isn't better than Linux for this, and HP gear and 
Dell gear are quite similar (for a number of reasons) in terms of 
reliability.  Since the same companies build the Dell and HP (and IBM, 
and ...) gear, it should be no surprise.

Rack mounts more stable than towers?  If the configs are identical, I 
think stability issues may be attributable to location, quality of 
power/cooling, and the occasional "server tipping" that happens when you 
are in a tower ...

> Far too many permutations to test / evaluate so I was wondering if 
> anyone has had any success (or horror) stories to help me choose.

Keep your hardware and software consistent.  If you buy Dell, then, buy 
Dell.  If you get Sun, or HP, or others, then get their stuff.  I might 
suggest that you speak with some folks in the UK HPC vendor scene.  We 
work with Streamline and they are quite good, and very helpful to their 
customers.  Lots of GE expertise.  Fire me a note if you need a contact 
email of a technical person.


> Thanks in advance for any help you may be able to offer.
> Regards
> Neil

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list