High Performance Computing

Using MATLAB on the chadwick HPC cluster

Cliff Addison and Ian C. Smith (ARC)

1. A word before diving in ...
2. Parallel computing using MATLAB built-in functions
4. Running MATLAB Cluster Jobs

1. A word before diving in ...

If you are looking to run MATLAB applications on the chadwick cluster (or on the Condor service for that matter), the chances are that your main motivation is to reduce the run time of the MATLAB code (although there might be other reasons such as larger memory or greater disk storage). If this is the case, then it worth spending some time trying to optimise your code first in order to reduce the (serial) execution time of it on a single core. By carefully examining your code, it may also be possible to determine which parts of it are most suited to execution in parallel on multiple cores and possibly multiple nodes.

Most programs used in science and engineering applications follow what is loosely called a "90/10" rule (sometimes "95/5") in that 90% of the run time is accounted for by just 10% of the code. The key to speeding up programs is therefore to identify the 10% of the code that is taking majority of the run time rather than wasting attention on the other 90% of the code that doesn't. Fortunately there is tool called a profiler that comes with MATLAB that makes this easy.

The profiler can provide a summary of the time spent in each program function and it is possible to "drill down" into functions and their "child" functions until we eventually see the time taken by individual statements. This is much each easier to see in practise than describe in writing so start up MATLAB, open your main script file in the editor and click the Run and Time button. Once the script has completed you will be presented with a results summary showing how many times functions are executed and how long they took to run in total.

The blue underlined text act like web links and by clicking on these, you can "drill down" and see the analysis in greater detail. Note that the total run time when using the profiler will always be longer than without it as some overhead is introduced by the profiling software itself. This should not alter the overall picture but it is worth checking the effect that any optimisations have both with and without the profiler. For more information on how to use the MATLAB profiler, refer to the built-in MATLAB help system.

From the profiler analysis, you should be able to see where your program is spending most of its time and therefore which code is worth optimising in order to reduce the overall execution time. There are many ways of doing this and the profiler itself will provide you with some hints (you can also email the Advanced Research Computing team at arc-support<@>liverpool<.>ac<.>uk) . The most important general rule is to to replace element-wise operations by matrix and vector operations wherever possible. This is because MATLAB was designed from the outset to perform matrix/vector operations efficiently and quickly (hence MatLab - Matrix Laboratory). If your code only uses element-wise operations then you will not get the full benefit of MATLAB and may as well be using "old school" languages such as C and FORTRAN 77.

It is probably easier to see this in the form of a quick example so consider this code fragment:

for i=1:n
  for j=1:n
    for k=1:n
      C(i,j) = C(i,j) + A(i,k) * B(k,j);
where A,B and C are dense square matrices of order n x n. If you have an eye for linear algebra, you may see that C is actually the matrix product of A and B i.e . C = AB. This may not be so apparent where this fragment is buried in a large amount of more complicated code.

With n=10000, the execution time for the fragment on one of the chadwick Westmere nodes (with exclusive access) was 39,370 s (almost 11 hours). We can improve on this by recognising that the inner loop actually represents a dot product viz: A(i,:) * B(:,j) which leads to this implementation:

for i=1:n
  for j=1:n
      C(i,j) = dot(A(i,:),B(:,j));
The run time for this example (again with n=10000) was 29,930 s (about 8 hours 20 minutes) which is a useful speed up of around 25 %. If we recognise that this actually a matrix product and use MATLAB built in functions then the coding becomes trivial:
C = A*B;
and the speed up dramatic. This required just 188 s (~ 3 minutes) of run time using a single core (but exclusive node access) - a speed up of 209 ! Using all 12 cores of a Westmere node, the run time was reduced to just 17 s - a speed up of 2,316 over the original.

Clearly this is a slightly artificial example however the MATLAB parfor construct can also be used to speed up execution in cases where it is not to possible to optimise code using MATLAB's built-in linear algebra functions. In a parfor loop, iterations can be executed concurrently (at the same time) on different processor cores. Using parfor involves a very significant overhead so it is important to include as much computational work as possible in each parfor loop (i.e. coarse-grained parallelism). In the above case, the best place for a parfor is at the outermost loop i.e.:

parfor i=1:n
  for j=1:n
    for k=1:n
      C(i,j) = C(i,j) + A(i,k) * B(k,j);
Using a Westmere node with 12 cores, this gave an execution time of 2,407 s - a speed up of 16.3 over the original serial version. There is a huge drop off in performance when the parfor is used in the middle loop:
for i=1:n
  parfor j=1:n
    for k=1:n
      C(i,j) = C(i,j) + A(i,k) * B(k,j);
This required an estimated 120,000 s of run time (a slow down of 4) and it is likely that replacing the innermost (k) loop with a parfor would have meant the code ran pretty much indefinitely. It worth noting that parfor can only be used for loops where the results of one iteration do not depend on the results of another. Another important consideration is that parfor loops cannot be nested (i.e. a parfor cannot appear inside another parfor loop).

There is also another MATLAB parallel construct called spmd which can be used to implement a style of parallel programming called Single Program Multiple Data where different loop iterations can perform different tasks determined by their loop index. Full details of parfor and spmd are available in the MATLAB built-in help system and there are a couple of examples in a later section.

A summary of the timings obtained with different types of parallelism is given below:

Serial original
codetime/sspeed up
original 39,370 1
dot product 29,930 1.32

parfor (element-wise)
corestime/sspeed up over serial*relative speed up
2 15,641 2.52 1
4 7,575 5.20 2.06
8 3,624 10.86 4.32
12 2,407 16.36 6.50
(* it is likely that these "super-linear" speed ups are down to faster overall memory acccess when running on multiple cores).

parfor (dot product)
cores time/s speed up over serial relative speed up
2 9,495 4.15 1
4 4,444 8.86 2.13
8 2,078 18.95 4.57
12 1,629 24.17 5.82

MATLAB matix multiply
cores time/s speed up over serial relative speed up
1 188 209 1
2 97 406 1.94
4 48 820 3.92
8 25 1575 7.52
12 17.5 2249 10.7

The amount of parallelism (and therefore potential speed up) available through parfor and spmd is limited by the number of cores on a node since our MATLAB installation has no mechanism for distributing parfor/spmd iterations across multiple nodes. On the Westmere nodes, there are a total of twelve cores and on Sandybridge sixteen cores. There are also 128 cores available on the large memory node. It is possible however to get more parallelisation by splitting jobs up in a similar way to on the Condor service (subject to there being enough spare cores available). This is described more fully in a later section.

2. Parallel computing using MATLAB built-in functions

Applications that make extensive use of predefined matrix operations (e.g. matrix factorisation or eigenvalue computations) can exploit parallelism effectively via MATLAB's built-in multi-threaded support simply by requesting more than one core per job. MATLAB detects the number of cores available to it automatically and defines its multi-threaded support accordingly.

In addition, MATLAB provides two other mechanisms for coarser grain parallel operations. These are:

It is beyond the scope of this note to fully describe these features and more details can be found in the MATLAB documentation on the Parallel Computing Toolbox. Nevertheless, some simple test programs which demonstrate a few features of parfor and spmd are available on the chadwick in /opt/software/local/MATLAB_Examples/parallel-runs. These examples are based upon the "classical" MPICH demonstration program of estimating π by numerically integrating 4.0/(1.0 + x*x) over [0,1] using the mid-point rule. The relevant MATLAB code fragment to do this serially is:

h = 1/n;
sum = 0.0;
for i=1:n
  x = h*(i - 0.5);
  sum = sum + 4.0/(1.0 + x*x);
approx1 = h*sum; 

Note that the loop to perform the sum can be broken into C sub-loops using C cores, each sub-loop can then be computed in parallel with a global reduction operation at the end to produce the overall sum. This example is purely for illustration purposes as there is far too little work done inside the loop for a parfor/spmd replacement to be to be used efficiently in practise.

The parfor construct detects the need for local-loop operations followed by global reduction operations automatically, so the above code fragment becomes a parallel loop merely by changing for to parfor viz:.

h = 1/n;
sum = 0.0;
parfor i=1:n
  x = h*(i - 0.5);
  sum = sum + 4.0/(1.0 + x*x);
approx1 = h*sum; 

The spmd construct also provides a means to make this loop parallel as shown in the following code fragment:

h = 1/n;
p = gcp; % Get current parallel pool information

 sum = 0.0;
 for i=labindex:numlabs:n
  x = h*(i - 0.5);
  sum = sum + 4.0/(1.0 + x*x);
 piApprox = gplus(myapprox);
approx1 = piApprox{1};   % 1st element holds value on worker 1.

Notice there are some predefined variables and operations. Each process knows the total number of processes in the parallel pool via numlabs and its own process number (starting from 1) via labindex. The global reduction sum creates a distributed vector of the result on each process.

Whilst demonstrating the parallel operations, this example makes poor use of MATLAB because built-in functions tend to be more efficient than loops. A better approach (and one recommended by Mathworks in their sample code) generally is to use the integral operator as shown in the following code fragment:

p = gcp;
integrand = @(x) 1./(1+x.*x);

    a = (labindex - 1)/numlabs;
    b = labindex/numlabs;
    myIntegral = integral(integrand, a, b);
    piApprox = gplus(myIntegral);

approx1 = 4.0 * piApprox{1};   % 1st element holds value on worker 1.

In this case, MATLAB also have a predefined function for 4.0/(1.0 + x*x).

The scripts available to explore parallelism in the parallel-runs example directory on the chadwick cluster /opt/software/local/MATLAB_Examples/parallel-runs

. These are:

These are also available as MATLAB functions here: /opt/software/local/MATLAB_Examples/parallel-runs/functions

These scripts can be run directly as M-files using matlab_submit or the function equivalents can be compiled using matlab_build and the resulting executables run via matlab_submit. It may be worth trying both to see if there is any performance gain in using the compiled version (we have not seen any in our tests though !).

The matlab_submit command has three options which determine where and how the job will run. The --cores_per_job option controls how many cores will be used to run the job on a single node (up to a maxium of 12 or 16 on the Westmere and Sandybridge nodes respectively and 128 on the large memory node). MATLAB seems not to work well when sharing a node with other jobs so it is a good idea to take up all the cores on a node.

The --memory_gb option can be used to set the maximum available memory per core that a job will require. If you have a serial (single core) job it is a good idea to set the memory limit explicity to avoid the job being killed by the batch scheduler for taking too much memory. To gain exclusive node access, specify all the memory for an entire node (48GB for the Westmere and 64GB for the Sandybridge nodes). Finally the --runtime option specifies the maximum run time for the job after which it will be killed automatically (default value 8 hours). The following example shows a job which requires 8 cores, 4 GB of memory per core (i.e. 32 GB in total) and a maximum run time of 16 hours:

$ matlab_submit --M_file=faster_pi.m --cores_per_job=8 --memory_gb=4 --runtime=16h 
Note a 'd' suffix can be used to specify runtimes in days. To use all of the default values a shortened command can be employed:
$ matlab_submit --M_file=faster_pi.m

You can monitor the progress of the job by using the qsub command. In the state (fifth) column, qw means that the job is queued and waiting, r that it is running and E that an error has occured. When the job disappears from the qstat output, the job has finished.

3. Condor-style parallel computing on chadwick

3.1 Introduction

The ARC Condor pool provides a very useful means of running large numbers of MATLAB jobs in parallel leading to impressive speed ups in time to completion. However, since jobs are run on standard PCs which are shared with ordinary logged in users, there are applications which will be not be able to be run effectively on Condor. This could be due to large memory or storage requirements or long run times. In each of these cases though, the chadwick cluster can provide a useful alternative. The same overall process applies to running Condor-style jobs on chadwick as on the actual Condor pool (with a few small but important changes) and is described fully here.

The application will need to perform three main steps:

  1. Create the initial input data and store it to files which can be processed in parallel.
  2. Process the input data in parallel using the cluster and write the outputs to corresponding files.
  3. Collate the data in the output files to generate the overall results.

All of the scripts for the examples below can be found on chadwick in /opt/software/local/MATLAB_Examples/array-job and equivalent functions in /opt/software/local/MATLAB_Examples/array-job/functions.

Since very large numbers of jobs can potentially run concurrently on the ARC Condor pool, it is necessary to use standalone executables rather than M-files on the pool to avoid over burdening the MATLAB license server. There is little chance of this happening on chadwick and hence it is possible to run M-files directly without needing to compile them first. It may however be worth trying the standalone executable version to see if it is any faster and for this reason the instructions for this have been included below (note that our initial tests have no produced any speed up though !).

3.2 Creating the M-files

As a trivial example, to see how the cluster can be used to run MATLAB jobs in parallel, consider the case where we wish to form the sum of p matrix-matrix products, i.e. calculate C where:

and A, B, and C are square matrices of order n. It easy to see that the p matrix products could be calculated independently and therefore potentially in parallel.

In the first step we need to store A and B to MATLAB data files which can be distributed by the job scheduler for processing. Each job will then need to read A and B from file, form the product AB and write the output to another data file. The final step will sum all of the partial sums read from the output files to form the complete sum C.

The first step can be accomplished using a M-file such as the one below (initialise.m) (this just fills the input matrices with random values but in a more realistic application this data would come from elsewhere):

function initialise(n)
  for index=0:9 
    save( filename, 'A', 'B');

The elements of A and B are given random initial values and are saved to files using the MATLAB 'save' command. Condor needs the input files to be indexed from 0:p-1 so to maintain consistency, the above code generates ten input files named input0.mat, input1.mat .. input9.mat. Once copied into a directory that you own, this M-file can be run on the cluster using:

$ qrsh
$ matlab_run
to start a non-graphical, interactive session on a compute core. Simply invoke the initialise function with your choice of n. If you wish to skip this step, there are ten predefined mat files generated with n=10 available in the examples directory /opt/software/local/MATLAB_Examples/array-job on chadwick. The file initialise10.m can be used to create a stand-alone executable to create ten 10 by 10 matrices named as above.

The second script will need to form the matrix-matix products and will eventually be run as a standalone application on the cluster. A suitable M-file (product.m) is:

  load input.mat;
  save( 'output.mat', 'C' );

(There is nothing special about the filenames input.mat and output.mat - they could be called anything but it's a good idea to stick to the MATLAB standard of using .mat as an extension).

The M-file does not need to manipulate the filenames to give them unique indexes since this will be taken care of by the job submission tools.

The final step is to collect all of the output files together to form the sum. This can be achieved using another M-file (collect.m) such as this:

function collect(n)
  S = zeros(n);
  for index=0:9
    filename = strcat( 'output', int2str( index ), '.mat' );
    load( filename, 'C' );
% Show the top left portion of S 

This loads each output file in turn and forms the sum of the matrix-matrix products in the S variable. Again there is a file collect10.m that can be used to create a stand-alone executable that collects ten 10 by 10 matrices into a final sum.

3.3 Creating a standalone application (optional)

As indicated earlier, it is not necessary to compile the M-file (or files) into a standalone executable to run on chadwick although you may want to try this to see if the standalone executable version is quicker (experience suggests not). You can create a standalone executable by using the following command which will submit a job that compiles the M-file on one of the compute cluster cores:

$ matlab_build product.m
Note that the product.m file needs to be a function rather than a script. You can find the code for this in /opt/software/local/MATLAB_Examples/array-job/functions. The command will return a JOB_ID and on completion the standalone executable product should have been created (you can confirm that your job has started execution using the qstat command (see below) and once running, you can monitor the progress by examining the contents of the log file matlab_build.log.o<JOB_ID>).


To create a standalone executable from multiple M-Files, first place the "main" M-file in a directory on the login node and create another directory called dependencies below it. Then place all of the other M-files (i.e. the ones containing functions used by the main M-file) in the dependencies directory. Be careful not to include any other files in the dependencies directory or these will be "compiled-in" as well. Once this is in place run:

$ matlab_build <MyMainMfile>
in the directory containing the main M-file.

3.4 Creating the job submission file

Each array job requires a job submission file, which is similar to those used for Condor and contains at least some of these attributes:

(only the executable and total_jobs attributes are compulsory but realistic applications will at the very least need the input files specfying). At present there is no default runtime. Certainly for initial runs, a run time of a few hours is a sensible check to avoid wasting resources. The maximum run time per job can be set in hours (e.g.48h) or days (e.g 2d).

For this example a submission file such as the one below can be used for the sum of products example:

indexed_input_files = input.mat
indexed_output_files = output.mat
M_file = product.m
indexed_stdout = logfile
cores_per_node = 1
total_jobs = 10

(Save this file under the filename product.sub - an "extension" is helpful for humans to avoid confusion with script files or an executable, which have no special extension in Linux or UNIX).

The M-file is specfied in the M_file line however, if you want to use a standalone executable, this can be specified by using the executable attribute instead. The lines starting indexed_input_files and indexed_output_files specify the input and output files which differ for each individual job. (Similarly for the indexed_stdout and indexed_stderr specifications.) The total number of jobs is given in the total_jobs line. The underlying job submission processes will ensure that each individual job receives the correct input file (input0.mat .. input9.mat) and that the output files are indexed in a corresponding manner to the input files (e.g. output file output1.mat will correspond to input1.mat)

It is also possible to provide a list of non-indexed files (i.e. files that are common to all jobs), for example:

input_files = common1.mat,common2.mat,common3.mat
This is particularly useful if the common (non-indexed) files are relatively large.

There are a few important points to be remember:

  1. Executables in Linux or UNIX do not have a .exe "extension" as in Windows. The standalone executable will instead have the same name as the M-file minus the .m part.

  2. The scheduler on the cluster (called Grid Engine) maintains a unique task ID for each individual job. By default, these are numbered 1..N rather than 0..N-1 as in Condor, however the MATLAB job submission tools ensure that the indexed filenames still have indices in the range 0..N-1 to ensure compatibility with Condor.

  3. For compatibilty, the indexing scheme for both input and output files is the same as for Condor with the index inserted between filename and "extension" e.g. input0.mat, input1.mat, ... input<N-1>.mat. This is despite the fact that UNIX does not really employ filename extensions in the same way as Windows.

  4. The indexed_output_files attribute is optional as with Condor. If it is omitted, then all of the output files returned to the job submission directory.

  5. The optional cores_per_node attribute is special to the cluster and allows users to specify how many processor cores will be assigned to each individual job. This is useful when running jobs with large memory requirements or jobs that make use of multi-threading parallelism. The value of C currently must be in the range 1 to 16, i.e. 1 ≤ C ≤ 16. The default is to use one core per job.

  6. For jobs that use M-files rather than a standalone executable, place all of the M-files in one directory and specify the main M-file using the M_file attribute in the job submission file (note the underscore rather than a hyphen).

3.5 Moving on to other applications

The files presented above can be used as templates in order to get your own MATLAB applications to run on the cluster. The following series of steps is suggested as a way of tackling the problem.

  1. Determine which part of the MATLAB application is taking the majority of the compute time (the MATLAB profiler is useful here) and place this code into a well defined function/script.

  2. Create an M-file for the function above (say process.m). Create two other M-files for the code to be executed both before process.m (say initialise.m) and after (say collect.m).

  3. Configure the M-files so that process.m reads its input variables from file and writes its output to file. It should now be possible to run the three M-files independently (initialise.m followed by process.m followed by collect.m).

  4. Run the M-file on the cluster using:
    $ matlab_submit m_file_job_description_file
  5. (Optional !) Build the standalone application using:
    $ matlab_build M-file
  6. Submit the standalone jobs using
    $ matlab_submit job_description_file
For more information on how to run multiple MATLAB jobs see http://arc.liv.ac.uk/chadwick/array.html.