[GE users] SGE + MVAPICH2 Loose Integration

Sangamesh B forum.san at gmail.com
Fri Sep 5 13:51:08 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi All,

      The cluster has 33 nodes (Quad core, Dual processor) with Mellanox
infiniband hardware.

The compute nodes have the IP addresses as follows:

Ethernet port:

172.16.1.254    compute-0-0.local compute-0-0 c0-0
172.16.1.253    compute-0-1.local compute-0-1 c0-1
...
...
172.16.1.223    compute-0-31.local compute-0-31 c0-31

Infiniband Port:

173.16.1.254    ibc0
173.16.1.253    ibc1
...
...
173.16.1.223    ibc31

During a parallel job submission I used PE=mpich2 (which is made for mpich2)
# qconf -sp mpich2
pe_name           mpich2
slots             9999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /opt/gridengine/mpi/startmpi.sh $pe_hostfile
stop_proc_args    /opt/gridengine/mpi/stopmpi.sh
allocation_rule   $fill_up
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min

The SGE script is as follows:

#!/bin/bash
#$ -N NAMD_SGE_PRL
#$ -q all.q
#$ -cwd
#$ -e Err.$JOB_NAME.$JOB_ID
#$ -o Out.$JOB_NAME.$JOB_ID
#$ -pe mpich2 16

/data/mvapich2_intel/bin/mpirun  -machinefile  $TMPDIR/machines  -np
$NSLOTS     /data/apps/namd26_mvapich2/Linux-mvapich2/namd2
/home/user1/namd_bench/apoa1/apoa1.namd

It didn't run, and gave following error:

$ cat Out.NAMD_SGE_PRL.22
/opt/gridengine/default/spool/compute-0-15/active_jobs/22.1/pe_hostfile
compute-0-15
compute-0-15
..
compute-0-31
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
mpiexec: unable to start all procs; may have invalid machine names
    remaining specified hosts:
        172.16.1.223 (compute-0-31.local)
        172.16.1.239 (compute-0-15.local)

The error is obvious, becaouse MVAPICH2's mpdboot is done through IB
interface ip addresses(173.xx.x.xx series). Since the PE is mpich2, the
startmpi.sh script is preparing machinefile based on Ethernet hostnames.

Then I followed document: MVAPICH Integration with SGE at:

http://gridengine.sunsource.net/howto/mvapich/MVAPICH_Integration.html

But this document doesn't apply to MVAPICH2, as there is no mpirun_rsh, etc
stuff.

Anyone on the list has the solution for it? What all things to be changed in
startmpi.sh script?

Thank you,
Sangamesh



More information about the gridengine-users mailing list