[GE users] Problem with the mpich2 sge integration for an mpiblast run [Solved]

Matthias Neder matthias.neder at gmail.com
Fri Apr 11 15:53:40 BST 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

for the short reader, i solved the problem.

The long term...

ok, i reinstalled the mpich2 again. And deleted all stuff first..
MY steps:
./configure --prefix=/opt/sge-root/mpich2_smpd --with-pm=smpd
--with-pmi=smpd
make
make install

added / modified -pe mpich2_smpd_rsh
#########################
pe_name           mpich2_smpd_rsh
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /opt/sge-root/mpich2_smpd_rsh/startmpich2.sh -catch_rsh  \
                  $pe_hostfile
stop_proc_args    /opt/sge-root/sge/mpich2_smpd_rsh/stopmpich2.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min
#########################

copied the mpich2_smpd_rsh dir from mpich2-60.tgz to /opt/sge-root/
copied mpich2_smpd_rsh.sh from mpihello paket to same dir
compiled mpihello

then fixed paths in the *.sh files
Finally run:

# qrsh -pe mpich2_smpd_rsh 4
/opt/sge-root/mpich2_smpd_rsh/mpich2_smpd_rsh.sh
found this nodes:
[15:10:56-root at HeadNode mpiblast]# qstat -t
job-ID  prior   name       user         state submit/start at
queue                          master ja-task-ID task-ID state cpu
mem     io      stat failed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    306 0.55500 mpich2_smp root         r     04/11/2008 15:10:07
all.q at Node-192-168-60-161.inte SLAVE            1.Node-192-168-60-161 r
00:00:36 1.58433 0.00000
    306 0.55500 mpich2_smp root         r     04/11/2008 15:10:07
all.q at Node-192-168-60-165.inte SLAVE            1.Node-192-168-60-165 r
00:00:43 1.88270 0.00000
    306 0.55500 mpich2_smp root         r     04/11/2008 15:10:07
all.q at Node-192-168-60-166.inte MASTER                        r     00:00:00
0.00035 0.00000

all.q at Node-192-168-60-166.inte SLAVE            1.Node-192-168-60-166 r
00:00:38 1.63002 0.00000
    306 0.55500 mpich2_smp root         r     04/11/2008 15:10:07
all.q at Node-192-168-60-193.inte SLAVE            1.Node-192-168-60-193 r
00:00:23 1.02186 0.00000

and got this on the master node (166):
########################
1458     1  1458 /opt/sge-root/bin/lx24-amd64/sge_execd
14489  1458 14489  \_ sge_shepherd-306 -bg
14528 14489 14528  |   \_ /opt/sge-root/utilbin/lx24-amd64/rshd -l
14529 14528 14529  |       \_ /opt/sge-root/utilbin/lx24-amd64/qrsh_starter
/opt
14530 14529 14530  |           \_ /bin/sh
/opt/sge-root/mpich2_smpd_rsh/mpich2_s
14531 14530 14530  |               \_ mpiexec -rsh -nopm -n 4 -machinefile
/tmp/
14532 14531 14530  |                   \_ mpiexec -rsh -nopm -n 4
-machinefile /
14533 14531 14530  |                   \_ /opt/sge-root/bin/lx24-amd64/qrsh
-inh
14565 14533 14530  |                   |   \_
/opt/sge-root/utilbin/lx24-amd64/r
14571 14565 14530  |                   |       \_
/opt/sge-root/utilbin/lx24-amd
14534 14531 14530  |                   \_ /opt/sge-root/bin/lx24-amd64/qrsh
-inh
14564 14534 14530  |                   |   \_
/opt/sge-root/utilbin/lx24-amd64/r
14568 14564 14530  |                   |       \_
/opt/sge-root/utilbin/lx24-amd
14535 14531 14530  |                   \_ /opt/sge-root/bin/lx24-amd64/qrsh
-inh
14566 14535 14530  |                   |   \_
/opt/sge-root/utilbin/lx24-amd64/r
14569 14566 14530  |                   |       \_
/opt/sge-root/utilbin/lx24-amd
14536 14531 14530  |                   \_ /opt/sge-root/bin/lx24-amd64/qrsh
-inh
14563 14536 14530  |                       \_
/opt/sge-root/utilbin/lx24-amd64/r
14567 14563 14530  |                           \_
/opt/sge-root/utilbin/lx24-amd
14561  1458 14561  \_ sge_shepherd-306 -bg
14562 14561 14562      \_ /opt/sge-root/utilbin/lx24-amd64/rshd -l
14570 14562 14570          \_ /opt/sge-root/utilbin/lx24-amd64/qrsh_starter
/opt
14572 14570 14572              \_ /opt/sge-root/mpich2_smpd_rsh/mpihello
########################

Is this ok? I think so.

Ok, now i tried the cpi.
Compiled the cpi from the source folder examples like: mpicc -o cpi cpi.c
copied cpi to /opt/sge-root/mpich2_smpd_rsh/  folder
changed the mpich2_smpd_rsh.sh to run cpi and run the command:
########################################
[15:24:26-root at HeadNode mpich2_smpd_rsh]# qrsh -pe mpich2_smpd_rsh 4
/opt/sge-root/mpich2_smpd_rsh/mpich2_smpd_rsh_cpi.sh
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-171 env
PMI_RANK=1 PMI_SIZE=4 PMI_KVS=36CEA22E6779DE89541FB4AC529B0E41
PMI_ROOT_HOST=Node-192-168-60-170 PMI_ROOT_PORT=46074 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpich2_smpd_rsh/cpi
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-174 env
PMI_RANK=3 PMI_SIZE=4 PMI_KVS=36CEA22E6779DE89541FB4AC529B0E41
PMI_ROOT_HOST=Node-192-168-60-170 PMI_ROOT_PORT=46074 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpich2_smpd_rsh/cpi
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-170 env
PMI_RANK=0 PMI_SIZE=4 PMI_KVS=36CEA22E6779DE89541FB4AC529B0E41
PMI_ROOT_HOST=Node-192-168-60-170 PMI_ROOT_PORT=46074 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpich2_smpd_rsh/cpi
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-172 env
PMI_RANK=2 PMI_SIZE=4 PMI_KVS=36CEA22E6779DE89541FB4AC529B0E41
PMI_ROOT_HOST=Node-192-168-60-170 PMI_ROOT_PORT=46074 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpich2_smpd_rsh/cpi
Process 3 of 4 is on Node-192-168-60-174
Process 1 of 4 is on Node-192-168-60-171
Process 2 of 4 is on Node-192-168-60-172
Process 0 of 4 is on Node-192-168-60-170
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.001564
#############################################################

That looks great as well! I think.

So now i try the mpiblast:
######################
[15:33:53-root at HeadNode mpich2_smpd_rsh]# cat mpich2_smpd_rsh_mpiblast.sh
#!/bin/sh

export MPIEXEC_RSH=rsh
export PATH=/opt/sge-root/mpich2_smpd/bin:$PATH

mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines
/opt/sge-root/mpiblast/bin/mpiblast -d nr -i
/mnt/san/mpiblast/KN-1143_QC.fas -p blastx -o
/mnt/san/mpiblast/KN-1143_QC-result2.txt

exit 0
[15:34:18-root at HeadNode mpich2_smpd_rsh]# which mpiexec
/opt/sge-root/mpich2_smpd/bin/mpiexec
[15:34:49-root at HeadNode mpich2_smpd_rsh]# qrsh -pe mpich2_smpd_rsh 6
/opt/sge-root/mpich2_smpd_rsh/mpich2_smpd_rsh_mpiblast.sh
Sorry, mpiBLAST must be run on 3 or more nodes
[unset]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
Sorry, mpiBLAST must be run on 3 or more nodes
[unset]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
Sorry, mpiBLAST must be run on 3 or more nodes
[unset]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
Sorry, mpiBLAST must be run on 3 or more nodes
[unset]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
Sorry, mpiBLAST must be run on 3 or more nodes
[unset]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
Sorry, mpiBLAST must be run on 3 or more nodes
[unset]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-178 env
PMI_RANK=1 PMI_SIZE=6 PMI_KVS=6BCDD02C51FD97F87CD911A17EEE456C
PMI_ROOT_HOST=Node-192-168-60-160 PMI_ROOT_PORT=38275 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpiblast/bin/mpiblast -d nr -i
/mnt/san/mpiblast/KN-1143_QC.fas -p blastx -o
/mnt/san/mpiblast/KN-1143_QC-result2.txt
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-160 env
PMI_RANK=0 PMI_SIZE=6 PMI_KVS=6BCDD02C51FD97F87CD911A17EEE456C
PMI_ROOT_HOST=Node-192-168-60-160 PMI_ROOT_PORT=38275 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpiblast/bin/mpiblast -d nr -i
/mnt/san/mpiblast/KN-1143_QC.fas -p blastx -o
/mnt/san/mpiblast/KN-1143_QC-result2.txt
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-193 env
PMI_RANK=5 PMI_SIZE=6 PMI_KVS=6BCDD02C51FD97F87CD911A17EEE456C
PMI_ROOT_HOST=Node-192-168-60-160 PMI_ROOT_PORT=38275 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpiblast/bin/mpiblast -d nr -i
/mnt/san/mpiblast/KN-1143_QC.fas -p blastx -o
/mnt/san/mpiblast/KN-1143_QC-result2.txt
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-191 env
PMI_RANK=3 PMI_SIZE=6 PMI_KVS=6BCDD02C51FD97F87CD911A17EEE456C
PMI_ROOT_HOST=Node-192-168-60-160 PMI_ROOT_PORT=38275 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpiblast/bin/mpiblast -d nr -i
/mnt/san/mpiblast/KN-1143_QC.fas -p blastx -o
/mnt/san/mpiblast/KN-1143_QC-result2.txt
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-192 env
PMI_RANK=4 PMI_SIZE=6 PMI_KVS=6BCDD02C51FD97F87CD911A17EEE456C
PMI_ROOT_HOST=Node-192-168-60-160 PMI_ROOT_PORT=38275 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpiblast/bin/mpiblast -d nr -i
/mnt/san/mpiblast/KN-1143_QC.fas -p blastx -o
/mnt/san/mpiblast/KN-1143_QC-result2.txt
/opt/sge-root/bin/lx24-amd64/qrsh -inherit Node-192-168-60-190 env
PMI_RANK=2 PMI_SIZE=6 PMI_KVS=6BCDD02C51FD97F87CD911A17EEE456C
PMI_ROOT_HOST=Node-192-168-60-160 PMI_ROOT_PORT=38275 PMI_ROOT_LOCAL=0
PMI_APPNUM=0 /opt/sge-root/mpiblast/bin/mpiblast -d nr -i
/mnt/san/mpiblast/KN-1143_QC.fas -p blastx -o
/mnt/san/mpiblast/KN-1143_QC-result2.txt
[15:35:05-root at HeadNode mpich2_smpd_rsh]#
#############################

So it seems the mpiblast is send to all nodes as well.

recompiling the mpiblast did the job.
############################

./configure --prefix=/opt/sge-root/mpiblast/
--with-mpi=/opt/sge-root/mpich2_smpd

#if en error occured during the ncbi part
./ncbi/make/makedis.csh
./ncbi/make/makedis.csh
make
make install
###########################


Now i got a fully sge integrated mpiblast!
Wohooo. thx for all the tipps.

Greetings Matthias





2008/4/11, Reuti <reuti at staff.uni-marburg.de>:
>
> Hi,
>
> Am 11.04.2008 um 12:29 schrieb Matthias Neder:
>
> > i still have some problems with the integration of mpich2 to sge.
> >
> > First what i installed:
> > -Installed sge 6.0
> > -Installed mpich2-1.0.7rc1.tar.gz like described here:
> > http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
> > integration.html
> > the "Tight Integration of the daemonless smpd startup method"
> > [12:04:16-root at HeadNode mpiblast]# qrsh -pe mpich2_smpd_rsh 4  mpiexec
> > -n 4 /opt/sge-root/mpich2/examples/cpi
> >
>
> please have a look at the file mpich2-60/mpich2_smpd_rsh/mpich2.sh in the
> supplied archive in the Howto: you also have to specify the hostfile. But I
> doubt, that it's possible just on the commandline. The $TMPDIR would be
> evaluated too early; or not at all if you put it in single quotation marks.
>
> (BTW: can you please check, whether the mpiexec is the one from MPICH2 you
> compiled for daemonless startup? The type of startup must match binary of
> the program compiled with this startup type. You can't change the startup by
> using only another PE in SGE. You have to recompile your application and use
> the appropriate mpiexec.)
>
>  Process 0 of 1 is on Node-192-168-60-171
> > Process 0 of 1 is on Node-192-168-60-173
> > pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> > wall clock time = 0.000235
> > Process 0 of 1 is on Node-192-168-60-169
> > Process 0 of 1 is on Node-192-168-60-172
> > pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> > wall clock time = 0.000302
> > pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> > wall clock time = 0.000248
> > pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> > wall clock time = 0.000233
> > [12:04:53-root at HeadNode mpiblast]#
> >
> > ####################
> > So the job is send to all nodes, at the same time.
> >
> > I can ran the cpi with commands:
> > mpdboot --ncpus=2 -n 25 -v -f /opt/sge-root/mpiblast/allhostlist &&
> > /opt/sge-root/mpich2/bin/mpiexec -n 24 ../mpich2/examples/cpi &&mpdallexit
> > The output looks good. pi is calculated though all nodes.
> >
> > I also tried the Tight Integration of the daemon-based smpd startup
> > method
> > Version with the pe:
> > ######################
> > [12:23:27-root at HeadNode mpiblast]# qconf -sp mpich2_smpd
> > pe_name           mpich2_smpd
> > slots             999
> > user_lists        NONE
> > xuser_lists       NONE
> > start_proc_args   /opt/sge-root/mpich2_smpd/startmpich2.sh -catch_rsh \
> >                  $pe_hostfile /opt/sge-root/mpich2_smpd
> > stop_proc_args    /opt/sge-root/mpich2_smpd/stopmpich2.sh -catch_rsh \
> >                  /opt/sge-root/mpich2_smpd
> > allocation_rule   $round_robin
> > control_slaves    TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> > ######################
> > If i start the pe with i got these:
> > ##########################
> > [12:23:56-root at HeadNode mpiblast]# qrsh -pe mpich2_smpd 4
> >  /opt/sge-root/mpich2_smpd/bin/mpiexec -n 4
> > /opt/sge-root/mpich2/examples/cpi
> > op_connect error: socket connection failed, error stack:
> > MPIDU_Socki_handle_connect(791): connection failure
> > (set=1,sock=16777216,errno=111:Connection refused)
> >
>
> Which rsh/ssh are you using? With a Tight Integration a random port is
> chosen, as long as you set MPICH2 to use a plain rsh (without a path). For
> this to work, any firewall on the nodes must at least allow unrestricted
> connections from other nodes in the cluster. You also have to set a port and
> supply it to the mpiexec as outlined in: mpich2-60/mpich2_smpd/mpich2.sh
>
>
>  unable to connect mpiexec tree, socket connection failed, error stack:
> > MPIDU_Socki_handle_connect(791): connection failure
> > (set=1,sock=16777216,errno=111:Connection refused).
> > [12:24:30-root at HeadNode mpiblast]#
> > ###################
> >
> > So i am a bit confused.
> >
> > I get the mpd ring runnin and got the cpi runnin in the ring. But the
> > integration failed in both ways, one time it starts the cpi on every node,
> > the other time it failed completly.
> >
> > Someone got an idea for me? Or better which way of integration should i
> > use for the mpiblast? Which is the best one for the mpiblast?
> >
>
> Could it work with Open MPI? The integration is in this case more straight
> forward http://www.open-mpi.org
>
> -- Reuti
>
>
>
> > Thx in advance.
> > Matthias
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list