[GE users] All OpenMPI process run on same node

bwillems bwi565 at gmail.com
Mon Oct 25 17:29:15 BST 2010

    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi All,

I 'm having trouble with SGE/OpenMPI on a Rocks cluster as all
processes of a parallel job tend to run on the same node. I searched
the forums, but any past posts on this do not solve my problem. I
compiled openmpi with

# ./configure --prefix=/share/apps?/openmpi/gcc --enable-static
--with-libnuma --with-sge --with-openib=/opt/ofed CC=gcc CXX=g++
F77=gfortran FC=gfortran

The PE I 'm using is

# qconf -sp mpi
pe_name mpi
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE

My test program is a simple mpihello compiled as

# /share/apps/openmpi/?gcc/bin/mpicc -o mpihello mpihello.c

and submitted with

# run job from current working directory
#$ -cwd
# combine stdout and stderr of job
#$ -j y
# use this shell as the default shell
#$ -S /bin/bash

# "-l" specifies resource requirements of job. In this case we are
# asking for 30 mins of computational time, as a hard requirement.
#$ -l h_cpu=00:30:00
# parallel environment and number of cores to use
#$ -pe mpi 16
# computational command to run
/share/apps/openmpi/?gcc/bin/mpirun -machinefile $TMPDIR/machines -np
$NSLOTS ./mpihello
exit 0

This leads to 16 processes running on a single node with only 12 cores
available. If I omit the machinefile option to mpirun, I get the

error: error: ending connection before all data received
error reading job context from "qlogin_starter"
A daemon (pid 16022) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
mpirun: clean termination accomplished

Pointing LD_LIBRARY_PATH to the libraries in the submission script
does not help either.

Any suggestions?



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list