This discusses running a sane variety of Linux ‘container’ under SGE. It covers the simple single-node case but also running multi-node MPI. The latter isn’t as straightforward as the propaganda suggests.

Container Sanity, and Otherwise

“I want to run Docker containers under SGE.” No, you don’t — really; or at least your clued-up system manager doesn’t want that. It just isn’t sane on an HPC system, or probably more generally under resource managers similar to SGE.

The short story is that (at least currently) Docker uses a privileged system daemon, through which all interaction happens, and using it is essentially equivalent to having root on the system. Other systems, like rkt, at least make that explicit, e.g. starting the containers from systemd. Apart from security issues, consider the implications for launching, controlling, and monitoring containers[1] compared with just exec’ing a program as normal in an appropriate resource allocation, especially if distributed parallel execution is involved.

Apart from requiring privileges, most ‘container’ systems at least default to a degree of isolation inappropriate for HPC, and may have other operating constraints (e.g. running systemd ☹). What we really want is the moral equivalent of secure, unprivileged chroot​ing to an arbitrary root, with relevant filesystem components and other resources visible.

Fortunately that exists.[2] The initial efforts along those lines were NERSC’s CHOS and Shifter which require particular system configuration support, providing a limited set of options to run.[3] Now Singularity provides the chroot-y mechanism we want, though the documentation doesn’t put it like that. It allows executing user-prepared container images as normal exec’ed user processes.[4]

As an example, our RHEL6-ish system doesn’t have an OS package of Julia, but has a julia executable which is a Debian container with julia as the entry point. It’s usually easy enough to build such a container from scratch, but if you insist on using a Docker one — which doesn’t start services, at least — you can convert it to the Singularity format reasonably faithfully. The Julia example actually was generated (with a running docker daemon) using

sudo singularity import -t docker -f julia /usr/local/bin/julia

Singularity Version

Note that the examples here use the 2.x branch of the Singularity fork from https://github.com/loveshack/singularity (rpms). It’s an older version than the latest at the time of writing, but with a collection of fixes (including for security) and enhancements relevant to this discussion (such as using $TMPDIR rather than /tmp) which unfortunately aren’t acceptable ‘upstream’. There are rpms under https://copr.fedorainfracloud.org/coprs/loveshack/livhpc/. Also, this assumes an RHEL6-like system as a reasonable lowest level of Linux features likely to be found in current HPC systems; newer kernels may allow other things to work. If you run Debian, SuSE, or another distribution, that’s probably perfectly fine, but you’ll need to translate references to package names, in particular. Of course, Singularity is Linux-specific and I don’t know whether any of this would carry over to, say, varieties of BSD or SunOS.

Running in SGE

Let’s turn to SGE specifics. Beware that this isn’t perfect; there are rough edges, particularly concerning what the container inherits from the host, but it’s usable with some fiddling. It demonstrates that how the portability claimed for containers is really a myth in a general HPC context, if that isn’t obvious.[5]

Single-node

Since Singularity containers are executable, in simple cases like the julia one, with an appropriately-defined container entry point, you can just run them under SGE like normal programs if the Singularity runtime is installed on the execution host.[6] You can run things inside them other than via the entry point:

qsub -b y singularity exec $(which julia) command args

Alternatively, we can avoid explicit mention of the container by specifying the shell SGE uses to start jobs, assuming the container has the shell as the entrypoint. That can be done by linking /bin/sh to /singularity (the execution entrypoint) in the container, either while building it or subsequently:

echo 'ln -sf /bin/sh /singularity' | singularity mount container /mnt

Then we can use

qsub -b y -S containercommand

so that the shepherd launches container -c command. Using the container as a shell with SGE is the key to the method of running MPI programs from containers below. The -S could be derived from a dummy resource request by a JSV, or you can put it in a .sge_request in a working directory and see, for example

$ cat x.o$(qsub -cwd -terse -sync y -b y -N x lsb_release -d)
Description:    Scientific Linux release 6.8 (Carbon)
$ cd debian
$ cat x.o$(qsub -cwd -terse -sync y -b y -N x lsb_release -d)
Description:    Debian GNU/Linux 8.6 (jessie)

after doing (for a container called debian.img)

cd debian; echo "-S $(pwd)/debian.img" > .sge_request

Distributed MPI

That’s for serial — or at least non-distributed — cases. The distributed parallel environment case is more complicated. This is confined to MPI, but presumably other distributed systems with SGE tight integration would be similar.

There are clearly two ways to operate:

  1. Execute mpirun[7] outside the container, i.e. use the host’s launcher, treating the container as a normal executable;

  2. Run it from within the container, i.e. the container’s installation of it.

Host-launched

If the container has an MPI installation that is compatible with the host’s — which in the case of Open MPI typically means just the same version — then you can probably run it as normal under mpirun, at least if you have the components necessary to use the MPI fabric in the container, including device libraries like libmlx4. MPI Compatibility is necessary because the library inside the container has to communicate with the orted process outside it in this mode.

Container-launched

Now consider launching within the container. You might think of installing Singularity in it and using something like

qsub -pesingularity exec container mpirun container

but that won’t work. Without privileges you can’t nest Singularity invocations, since sexec, which launches a container, is setuid and setuid is disabled inside containers.

However, it is still possible to run a distributed job with an MPI implementation in the container independently of the host’s, similarly to the serial case. It’s trickier with more rough edges, but provides a vaguely portable solution if you can’t dictate the host MPI.

The container must be prepared for this. First, it must have enough of an SGE installation to run qrsh for launching on the slave nodes. That means not only having the binary, but also having SGE_ROOT etc. defined, with necessary shared directories mounted into it (or the necessary files copied in if you don’t use a shared SGE_ROOT). In the shared-everything system here using CSP security, they’re used like this in singularity.conf:

bind path maybe = /opt/sge/default
bind path maybe = /var/lib/sgeCA

where the maybe (in the Singularity fork) suppresses warnings for containers lacking the mount point.[8]

There are two aspects of startup of MPI-related processes inside the container. The mpirun process is started as the SGE job, which could involve either supplying the command line as arguments to the container executable, or invoking singularity exec. When that tries to start processes on remote hosts with qrsh from inside the container, we need a hook to start them in the container. qrsh -inherit doesn’t use the shell — perhaps it should — but we can use QRSH_WRAPPER instead.

There are two ways of setting this up. We could define a new parallel environment, but it’s possible to do it simply as a user. Rather than supplying the mpirun command line as an argument of the container, we can use qsub​'s -S switch to invoke the container — if with even more container preparation. The following diagram shows how it works.

singularity mpi
Figure 1. A Control flow for MPI launched within the container

Here’s a demonstration of running an openmpi-1.10 Hello World in a CentOS 7 container (c7) on Scientific Linux 6 hosts which have openmpi-1.6 as the system MPI (for historical reasons), where num_proc arranges two nodes on the heterogeneous system. I haven’t yet debugged the error message.

$ cat mpirun.o$(qsub -terse -sync y -b y -pe mpi 14 -l num_proc=12 -S ~/c7 mpirun mpi-hello-ompi1.10)
Hello, world, I am 1 of 2, (Open MPI v1.10, package: Open MPI mockbuild@ Distribution, ident: 1.10.0, repo rev: v1.10-dev-293-gf694355, Aug 24, 2015, 121)
  [...]
Hello, world, I am 13 of 14, (Open MPI v1.10, package: Open MPI mockbuild@ Distribution, ident: 1.10.0, repo rev: v1.10-dev-293-gf694355, Aug 24, 2015, 121)
ERROR  : Could not clear loop device /dev/loop0: (16) Device or resource busy

This is the sort of thing you need to install as /singularity in the container to make that work [download link].[9] Note that it won’t work the same for jobs run with qrsh, which doesn’t support -S, so you’d have to run the container explicitly under singularity and set QRSH_WRAPPER.

/singularity example
#!/bin/sh

# Set up for MPI.  An alternative to get "module" defined is to use
# "/bin/sh -l" if .profile is sane when used non-interactively, or
# to pass module, MODULESHOME, and MODULEPATH in the SGE environment.
. /etc/profile
module add mpi

# Assume submission with qsub -S <this container>; then we need
# QRSH_WRAPPER defined foe starting remote tasks.  This wouldn't work if
# SHELL was a real system shell, but we can't check it actually is a
# container since it may not be in our namespace.
QRSH_WRAPPER=${QRSH_WRAPPER:-$SHELL}; export QRSH_WRAPPER

# We might be called as either "<cmd> -c ..." (for job start), or as
# "<cmd> ..." (via qrsh).
case $1 in
-c) exec /bin/sh "$@";;
*) exec $*;;
esac

Of course, instead of fixing /singularity like that, you could use two slightly different shell wrappers around singularity exec for the shell and QRSH_WRAPPER, which would work for any container in which you can run /bin/sh, but you’d have to pass the file name of the container in the environment.

You could think of using a privileged PE start_proc_args to modify (a copy of?) the container on the fly. That would insert the SGE directories, /singularity, and, perhaps, a static qrsh; it doesn’t seem too clean, though.

There shouldn’t be any dependence on the specific MPI implementation used, providing it is tightly integrated. However, some release of Open MPI 2 has/will have some undocumented(?) knowledge of Singularity in the schizo framework. That messes with the search path and the Singularity session directory for containers it starts. In this case, neither seems useful. The runtime won’t detect Singularity anyhow if the container is invoked directly, rather than with singularity exec, so the session directory will be directly in the TMPDIR created by tight integration, rather than Open MPI’s sub-directory chosen by schizo.

Summary

To run MPI programs, the container has to be compatible with the host setup, either by using a compatible MPI installation and taking advantage of the host mpirun, by being prepared with support for running qrsh from within it and a suitable entrypoint to support the SGE -S and QRSH_WRAPPER used to launch within it, or using shell wrappers to avoid the entrypoint. Obviously you could use a JSV to modify the job parameters on the basis of a pseudo-complex, e.g. -l singularity=debian. As well as setting parameters, the JSV could do things like checking the container for the right mount points etc.

I don’t have operational experience to recommend an approach from the various possibilities.


1. ‘Container’ means different things to different people and historically. Originally it was an operating system concept, like Solaris zones or OpenVZ instances, to contain processes in the sense of ‘isolate’, rather than a sort of packaging of applications and the necessary support, or an instantiation of such packaging. There’s a lot more to say, with various other points of comparison.
2. In addition to the following, there’s proot. It doesn’t seem to be maintained, and uses ptrace, which will make the syscalls slow, and Linux doesn’t have a very good security record in the area, but may still be useful. It also uses a directory à la chroot, rather than a filesytem image qua executable. However, it supports roots for different architectures via QEMU. It might allow building Singularity images without privileges, but I haven’t tried, and suspect that won’t work.
3. Shifter has been said to depend on SLURM, but it was clear how to provide SGE support when i started down that track.
4. Whoever builds the container does require root access somewhere (e.g. their desktop) to build it, and there’s a setuid program involved in the container startup procedure which handles the temporary privilege escalation necessary for mounts and namespace setup.
5. For instance, if you’re trying to run an MPI program in a container that doesn’t support the relevant interconnect — which might require a proprietary library — it clearly isn’t going to work. Consider also MPI-IO support, which is likely to require filesystem libraries, especially for a userspace filesystem like PVFS (OrangeFS).
6. It can’t be packaged into the container image file as well as the real payload, at least because of the setuid component since setuid is disabled in it.
7. Other launchers may be available.
8. The cluster on which that was run actually has the spool directory mounted on a separate filesystem linked into the cell directory, so has an extra bind mount for that.
9. Use singularity shell -w or singularity mount as root to modify the container contents.