[GE users] SGE/OpenMPI - all MPI tasks run only on a single node

k_clevenger kclevenger at coh.org
Wed Dec 16 19:16:43 GMT 2009


When an job is submitted all the tasks execute only on one node. If I submit the same job via mpiexec on the cmdline tasks are dispersed correctly.

I have reviewed "OpenMPI job on stay on one node", "Using ssh with qrsh and qlogin", the SGE sections on the OpenMPI site, etc. with no solution. 

Nodes: 16 core x86_64 blades
OS (all): CentOS 5.4 x86_64
SGE Version: 6_2u4
OpenMPI Version: 1.3.3 compiled with --with-sge
ompi_info: MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
IPTables off

PE:
pe_name            openmpi
slots              32
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/sge-6_2u4/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args     /opt/sge-6_2u4/mpi/stopmpi.sh
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

SGE script:
#!/bin/sh
#$ -pe openmpi 22
#$ -N Para1
#$ -cwd
#$ -j y
#$ -V
#
mpiexec -np $NSLOTS -machinefile $TMPDIR/machines ./hello_c

Run via SGE
Hello, world, I am 0 of 22 running on sunnode00.coh.org
Hello, world, I am 1 of 22 running on sunnode00.coh.org
...
Hello, world, I am 20 of 22 running on sunnode00.coh.org
Hello, world, I am 21 of 22 running on sunnode00.coh.org

All 22 tasks run on sunnode00

Run via cmdline 'mpiexec -np 22 -machinefile $HOME/machines ./hello_c'
Hello, world, I am 0 of 22 running on sunnode00.coh.org
Hello, world, I am 1 of 22 running on sunnode01.coh.org
....
Hello, world, I am 20 of 22 running on sunnode00.coh.org
Hello, world, I am 21 of 22 running on sunnode01.coh.org

11 tasks run on sunnode00 and 11 tasks run on sunnode01

I also get all 22 tasks running on one node if I run something like 'qrsh -V -verbose -pe openmpi 22 mpirun -np 22 -machinefile $HOME/machines $HOME/test/hello'

qconf -sconf output is attached

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=233785

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, "qconf.txt"  Text/PLAIN (Name: "qconf.txt") ~2.1 KB. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list