[GE users] All OpenMPI process run on same node

bwillems bart at atipa.com
Wed Oct 20 22:23:36 BST 2010

I 'm having trouble with SGE/OpenMPI on a Rocks cluster as all processes of
a parallel job tend to run on the same node. I searched the forums, but any
past posts on this do not solve my problem. I compiled openmpi with

# ./configure --prefix=/share/apps/openmpi/gcc --enable-static
--with-libnuma --with-sge --without-tm --with-openib=/opt/ofed CC=gcc
CXX=g++ F77=gfortran FC=gfortran

The PE I 'm using is

# qconf -sp mpi
pe_name            mpi
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/gridengine/mpi/startmpi.sh $pe_hostfile
stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE

My test program is a simple mpihello compiled as 

# /share/apps/openmpi/gcc/bin/mpicc -o mpihello mpihello.c 

and submitted with

# run job from current working directory
#$ -cwd
# combine stdout and stderr of job
#$ -j y
# use this shell as the default shell
#$ -S /bin/bash

# "-l" specifies resource requirements of job.  In this case we are
# asking for 30 mins of computational time, as a hard requirement.
#$ -l h_cpu=00:30:00
# parallel environment and number of cores to use
#$ -pe mpi 16
# computational command to run
/share/apps/openmpi/gcc/bin/mpirun -machinefile $TMPDIR/machines -np $NSLOTS
exit 0

This leads to 16 processes running on a single node with only 12 cores
available. If I omit the machinefile option to mpirun, I get the following

error: error: ending connection before all data received
error reading job context from "qlogin_starter"
A daemon (pid 16022) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
mpirun: clean termination accomplished

Pointing LD_LIBRARY_PATH to the libraries in the submission script does not
help either. 

Any suggestions?



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list