High Performance Computing

HPC Cluster Hardware

Design Rationale

If you are reading this on your own desktop PC, there is a good chance that you will be running several other programs at the same time as the web browser: for example a word processor, an email client and possibly even playing some music on the computer in the background. This ability to run many applications at once (called multi-tasking) is a feature of modern opeating systems such as Windows 7/8 and has been substantially aided by the development of modern multi-core processors such as the Intel Core i3. Most of the time everything runs smoothly but if you attempt to run a program that does a lot of computational work in the background (for example a MATLAB or R script or your own FORTRAN program), the PC may become less responsive. Attempt to run a few copies of the computation program at the same time and it may become impossible to use the PC for anything else.

In a cluster, special hardware is assigned to running computationally intensive programs called compute nodes. Users. in general, do not (and indeed should not) access these directly but instead interact with the them via a special host called the head node (sometimes also called the login node). Compute nodes only run the intensive programs and therefore nearly all of their resources (including memory and processor(s)) can be assigned to these programs. This allows them to run these programs as quickly as possible. Another benefit is that users are unhindered by computational tasks and can carry on with other work on the head node (or indeed another system). The system architecture (organisation) of a typical cluster is shown below:

Compute Nodes

Compute nodes are to an extent similar to top of the range PCs and employ commodity processors from familiar names such as Intel and AMD. Where they differ is that each compute node can contain multiple processors and each processor can contain multiple computing elements called cores. High-end systems can have as many as 32 or 64 cores per node however eight or twelve is a more usual figure. Another difference is that a compute node will generally contain far more memory than a PC so that programs with very large data requirements can be accomodated. In some cases there may be as much as 128 GB of memory for a single compute node but 4 GB per core is a more usual figure.

The Head Node

The head node does not need to have as high a specification as the compute nodes and will typically be a medium performance server similar to a web server. Importantly, the head node has a network connection to the outside world so that users can login to it and run the programs on the compute nodes. The head node also runs the batch system such as Sun Grid Engine or Slurm which allows users to run their jobs on the compute nodes.

Network Infrastructure

As indicated early, the whole raison d'etre of clusters is to help speed up the solution of computational problems. This is usually tackled by dividing the problem into smaller parts and processing these parts at the same time - in other words concurrently or "in parallel". This is essentially a divide-and-conquer stategy. For some problems, it may be possible to process each part independently of the others (so-called embarrasingly parallel problems) but more often the calculations are dependent on each other and information needs to be exchanged between the processing parts (for a naive time-stepping simulation code this could be needed for each new time step). Ultimately it is the time taken to communicate this information which limits the overall speed up achievable by any particular program. Mindful of this, cluster manufacturers go to great lengths to link compute nodes by a high speed network (interconnect) to reduce communication overheads. Not only does this provide high bandwidth (up to 40 Gb/s), it also exhibits low message set up times (called latency).

The interconnect actually comprises a number of components namely the physical network cables and a piece of hardware called a switch. The role of the switch is to connect the compute nodes and move data between them in an efficient manner. In some ways, the switch performs a similar role to that of a telephone exchange. If, for example, one compute node has data it needs to send to another, the switch will set up a connection between them and clear the connection when all of the data has been transferred. You can think of this in the same way as a phone call - in fact the term switch actually originates in telephony. The performance the switch is crucial to that of the whole cluster and it often forms a significant part of the system cost.

In addition to the interconnect, other networks may link the various components of the system. A fairly low specfication ethernet connection is usually provided between the head node and the compute nodes allowing users to login to the compute nodes if necessary. This network can also be used for the transfer of data between the nodes and the parallel filestore (see below), although in high-end systems, a separate dedicated high speed network is often used instead.

File Storage

Application software run on clusters typically needs to access large amounts of data stored in files and equally may create large amounts of data which needs to be stored as output files. Add in the extra requirement that different programs running on the compute nodes need (near) simultaneous access to data files and the task soon becomes too much for an ordinary shared disk storage system to cope with. For this reason, clusters generally use a parallel filestore to store data files. This will usually support fast simultaneous access to data and large storage capacity. User's home filestore will usually sit on the parallel filestore which is accessible by the head nodes and all of the compute nodes.