[GE users] Some interesting results involving Nehalem & Hyperthreading -- comments appreciated

elrad elrad at brandeis.edu
Thu May 14 23:01:54 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi all,

I've been adding new compute nodes to our HPCC based on Intel's new Nehalem arch and, in the process of testing, came up with some interesting results about the effects of hyperthreading on various jobs types and, most importantly to this list, the implications for how SGE should deal with HT-enabled computational resources. All the results are from the same Dell R610 machine with 2x XEON E5540 @ 2.53GHz (4 cores each).

First up is NAMD 2.7b1 -- compiled with g++ (I'll do icc at some point, having build issues) for x86_64, running the apoa test (http://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz) that's suggested in the release notes. The results are taken from the "Wallclock" information at the end of the simulation output. Namd was launched with "charmrum ++local +p $NUM_T `which namd2` apoa1.namd"

HT     NUM_T    WALLCLOCK
---------------------------------
ON      8               144
ON     16              128
OFF    8               144
OFF   16              158

HT gives a ~12% speedup when running 16 threads and no penalty when running 8 threads.

Next up, some home-rolled multithreaded simulation code built with Intel Thread Building Blocks. All the details (number of threads, the partitioner algorithm, ...) were left to the wisdom of TBB. Note that the code requires no network IO and fully buffers all disk writes to RAM, so your results will vary if you have IO issues.

HT      WALLCLOCK
------------------------------
ON          781
OFF        902

Again, ~12% speedup running a multithreaded app.

Finally, a single threaded version of that same code, running NUM_P concurrent processes.

HT      NUM_P     AVG_WALLCLOCK (over NUM_P jobs)
ON        8                     82
ON       16                  154
OFF      8                     80
OFF      16                 162

Interestingly, even for a single-threaded app that I thought was 100% cpu bound, running one per logical processor leads to a 6% bump, at the price that you have to wait twice as long to get your first result. Also interesting, there is a 2.5% penalty in the 8 process case with HT on, probably due to inefficient scheduling. This might be mitigated by longer jobs if the inefficiency "settles out" after some time.

Implications for SGE scheduling:

Part of the reason I posted these results is the question of how to configure the number of slots that SGE assigns to the nodes. By default, SGE (5.1) assigns 16, leading to best overall throughput but disappointing users that require timely feedback to configure their future jobs. For the time being, we are forcing SGE to assign 8 slots to the jobs and instructing users that they may spawn 16 threads if they have all 8 slots on a particular node.

Any suggestions, comments or criticisms of test methodology are very welcome!

----------------
Oren Elrad
Dept of Physics
Brandeis University



More information about the gridengine-users mailing list