[GE users] Problem with the mpich2 sge integration for an mpiblast run

Reuti reuti at staff.uni-marburg.de
Fri Apr 11 12:12:03 BST 2008


Hi,

Am 11.04.2008 um 12:29 schrieb Matthias Neder:
> i still have some problems with the integration of mpich2 to sge.
>
> First what i installed:
> -Installed sge 6.0
> -Installed mpich2-1.0.7rc1.tar.gz like described here:
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
> integration.html
> the "Tight Integration of the daemonless smpd startup method"
> [12:04:16-root at HeadNode mpiblast]# qrsh -pe mpich2_smpd_rsh 4   
> mpiexec -n 4 /opt/sge-root/mpich2/examples/cpi

please have a look at the file mpich2-60/mpich2_smpd_rsh/mpich2.sh in  
the supplied archive in the Howto: you also have to specify the  
hostfile. But I doubt, that it's possible just on the commandline.  
The $TMPDIR would be evaluated too early; or not at all if you put it  
in single quotation marks.

(BTW: can you please check, whether the mpiexec is the one from  
MPICH2 you compiled for daemonless startup? The type of startup must  
match binary of the program compiled with this startup type. You  
can't change the startup by using only another PE in SGE. You have to  
recompile your application and use the appropriate mpiexec.)

> Process 0 of 1 is on Node-192-168-60-171
> Process 0 of 1 is on Node-192-168-60-173
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000235
> Process 0 of 1 is on Node-192-168-60-169
> Process 0 of 1 is on Node-192-168-60-172
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000302
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000248
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000233
> [12:04:53-root at HeadNode mpiblast]#
>
> ####################
> So the job is send to all nodes, at the same time.
>
> I can ran the cpi with commands:
> mpdboot --ncpus=2 -n 25 -v -f /opt/sge-root/mpiblast/allhostlist  
> && /opt/sge-root/mpich2/bin/mpiexec -n 24 ../mpich2/examples/cpi  
> &&mpdallexit
> The output looks good. pi is calculated though all nodes.
>
> I also tried the Tight Integration of the daemon-based smpd startup  
> method
> Version with the pe:
> ######################
> [12:23:27-root at HeadNode mpiblast]# qconf -sp mpich2_smpd
> pe_name           mpich2_smpd
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/sge-root/mpich2_smpd/startmpich2.sh - 
> catch_rsh \
>                   $pe_hostfile /opt/sge-root/mpich2_smpd
> stop_proc_args    /opt/sge-root/mpich2_smpd/stopmpich2.sh -catch_rsh \
>                   /opt/sge-root/mpich2_smpd
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
> ######################
> If i start the pe with i got these:
> ##########################
> [12:23:56-root at HeadNode mpiblast]# qrsh -pe mpich2_smpd 4  /opt/sge- 
> root/mpich2_smpd/bin/mpiexec -n 4 /opt/sge-root/mpich2/examples/cpi
> op_connect error: socket connection failed, error stack:
> MPIDU_Socki_handle_connect(791): connection failure  
> (set=1,sock=16777216,errno=111:Connection refused)

Which rsh/ssh are you using? With a Tight Integration a random port  
is chosen, as long as you set MPICH2 to use a plain rsh (without a  
path). For this to work, any firewall on the nodes must at least  
allow unrestricted connections from other nodes in the cluster. You  
also have to set a port and supply it to the mpiexec as outlined in:  
mpich2-60/mpich2_smpd/mpich2.sh


> unable to connect mpiexec tree, socket connection failed, error stack:
> MPIDU_Socki_handle_connect(791): connection failure  
> (set=1,sock=16777216,errno=111:Connection refused).
> [12:24:30-root at HeadNode mpiblast]#
> ###################
>
> So i am a bit confused.
>
> I get the mpd ring runnin and got the cpi runnin in the ring. But  
> the integration failed in both ways, one time it starts the cpi on  
> every node, the other time it failed completly.
>
> Someone got an idea for me? Or better which way of integration  
> should i use for the mpiblast? Which is the best one for the mpiblast?

Could it work with Open MPI? The integration is in this case more  
straight forward http://www.open-mpi.org

-- Reuti


>
> Thx in advance.
> Matthias


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list