[GE users] mpich2 configuration

markhewitt mh613 at york.ac.uk
Tue Nov 3 13:53:08 GMT 2009

I'm having problems getting my mvapich2 (an implementation of mpich2) 
working via SGE using tight integration.

I'm using the MPD method and everything seems to work ok when I run 
things manually
./mpdboot -n 18 -f hostfile
 <shows list of machines>
./mpirun -machinefile hostfile -n 18 ./cpi
Runs and shows the expected output, in addition I can see the processes 
running on each individual node.

I've set up a parallel environment for it as per reuti's explanation page:
pe_name            mpich2_mpd
slots              8
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/n1ge6/mpich2_mpd/startmpich2.sh -catch_rsh \
                  $pe_hostfile /wrg/software/SL4.x86-64/mvapich2
stop_proc_args     /opt/n1ge6/mpich2_mpd/stopmpich2.sh -catch_rsh \
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

Plus I've compiled the helper applications with ./aimk; ./install.sh 
which worked fine.

Now I come to submit a job to the SGE cluster this is my submission file 

#$ -cwd
#$ -V
#$ -pe mpich2_mpd 1
#$ -l mpi_htx2=true  ## This is just because only some machines have 
working IB cards

export MPICH2_ROOT=/wrg/software/SL4.x86_64/mvapich2
export PATH=$MPICH2_ROOT/bin:$PATH

echo "Got $NSLOTS slots."
# The order of arguments is important. Forst global, then local options.
mpiexec -machinefile $TMPDIR/machines -n $NSLOTS ~/cpilog
exit 0

I then submit the job using qsub mpich2_mpd.sh

I get back a .pe file with the following contents:
usage: start_mpich2 [-n <hostname>] mpich2-mpd-path [mpd-parameters ..]

where: 'hostname' gives the name of the target host
Host key verification failed.

Sometimes I don't get the usage error for some reason, but I always get 
"Host key verificiation failed". I've checked and SSH is enabled without 
password between all MPI hosts, plus the MVAPICH1 installation we have 
on SGE works.

Does anyone have any ideas what configuration errors I have here? I'm 
fairly sure it's a configuration error with SGE rather than my 
MPI/MVAPICH installation as everything works ok when I run things 
outside of SGE.

Many thanks for your help.


