[GE users] TI of MPICH2 + SGE

Reuti reuti at staff.uni-marburg.de
Fri Oct 3 19:47:25 BST 2008


Hi,

Am 01.10.2008 um 07:43 schrieb Sangamesh B:

> On Fri, Jul 11, 2008 at 3:50 PM, Reuti <reuti at staff.uni-marburg.de>  
> wrote:
> Hiho,
>
> Am 11.07.2008 um 07:00 schrieb Sangamesh B:
>
>
>      I'm performing the tight integration of MPICH2 with Sun Grid  
> Engine using smpd Process manager. Referred the document of Reuti  
> available at:
>
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
> integration.html
>
> The cluster has two nodes(1 Master + 1 compute, AMD64 dual core).  
> But the compute node is not working.
> So I'm testing the TI only on Master Node. Is this ok?
>
> yes.
>
>
> Some environment setup is:
>
> SGE_ROOT=/opt/gridengine
>
> MPICH2_ROOT=/opt/libs/mpi/mpich2/1.0.7/smpd
>
> I've done all the steps mentioned in the document.
>
> But both daemonless and daemonbased tight integrations are not  
> working.
>
> With daemonbased method, the error is:
>
> The thing to realize is the calling chain of the tools how it  
> should be:
>
> - MPICH2 will call rsh when started when you set "MPIEXEC_RSH=rsh;  
> export MPIEXEC_RSH"
> - rsh willl be caught by SGE's's RSH wrapper
> - SGE will start an RSH daemon per "qrsh -inherit ..." on a random  
> port
>
> If you need to use SSH at all, you can it have this way:
>
> - configure SGE to use SSH instead of RSH (http:// 
> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
> - MPICH2 will call rsh
> - rsh willl be caught by SGE's RSH wrapper
> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random  
> port
> (means: you could even instruct MPICH2 to call "blabla" and create  
> a wrapper for "blabla" - at this stage it's just a name which could  
> be set to anything)
>
> Or:
>
> - configure SGE to use SSH instead of RSH (http:// 
> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
> - configure the start_proc_args of the PE to create an SSH wrapper  
> instead of an RSH wrapper
> - MPICH2 will call ssh
> - ssh willl be caught by SGE's SSH wrapper
> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random  
> port
>
> In all cases: it will start on a random port. There no need either  
> in one of the cases to have an rshd or sshd running all the time.  
> SGE will start them for you.
>
> Question: is there any firewall blocking the traffic on certain  
> ports (shouldn't prevent a local call anyways), or some setting in / 
> etc/hosts.allow or /etc/hosts.deny?
>
> -- Reuti
> Hi Reuti,
>
> Thanks for the reply.
>
> Now rsh is working, and smpd Daemonless Tight Integration is also  
> done.
>
> But facing problem with smpd daemonbased TI.
>
> Details:
>
> # qconf -sp mpich2_smpd_DB
> pe_name           mpich2_smpd_DB
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/gridengine/mpich2_smpd/
> startmpich2.sh -catch_rsh  \
>                   $pe_hostfile /opt/mpich2/gnu/smpd
> stop_proc_args    /opt/gridengine/mpich2_smpd/stopmpich2.sh - 
> catch_rsh \
>                   /opt/mpich2/gnu/smpd
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
>
> $ cat mpi_s_db.sh
> #!/bin/sh
>
> #export PATH=/home/reuti/local/mpich2_smpd/bin:$PATH
>
> port=$((JOB_ID % 5000 + 20000))
>
> #mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $port /home/ 
> reuti/mpihello
>
> /opt/mpich2/gnu/smpd/bin/mpiexec -n $NSLOTS -machinefile $TMPDIR/ 
> machines -port $port /home/san/mpich2_smpd/hellompi
>
> exit 0
>
> $ qsub -pe mpich2_smpd_DB 8 mpi_s_db.sh
> Your job 181 ("mpi_s_db.sh") has been submitted
>
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   0/4       0.00     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>
> ###################################################################### 
> ######
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
> ###################################################################### 
> ######
>     181 0.00000 mpi_s_db.s san          qw    09/30/2008  
> 16:57:51     8
>
>
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   4/4       0.00     lx26-amd64
>     181 0.60500 mpi_s_db.s san          r     09/30/2008  
> 16:57:59     4
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.local       BIP   4/4       0.02     lx26-amd64
>     181 0.60500 mpi_s_db.s san          r     09/30/2008  
> 16:57:59     4
> ---------------------------------------------------------------------- 
> ------
> test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   0/4       0.01     lx26- 
> amd64    E
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at compute-0-0.local       BIP   0/4       0.01     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>
> ###################################################################### 
> ######
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
> ###################################################################### 
> ######
>     181 0.60500 mpi_s_db.s san          qw    09/30/2008  
> 16:57:51     8
>
>
> ERROR INFO
>
> $qstat  -j  181
> ..
>
> parallel environment:  mpich2_smpd_DB range: 8
> error reason    1:          09/30/2008 16:58:49 [400:868]:  
> exit_status of pe_start = 1
> scheduling info:            queue instance  
> "all.q at compute-0-0.local" dropped because it is temporarily not  
> available
>                             cannot run in queue "test.q" because PE  
> "mpich2_smpd_DB" is not in pe list
>                             cannot run in PE "mpich2_smpd_DB"  
> because it only offers 4 slots
>
> $ cat mpi_s_db.sh.pe181
> Permission denied, please try again.
> Permission denied, please try again.
> Permission denied (publickey,gssapi-with-mic,password).
> Permission denied, please try again.
> Permission denied, please try again.
> Permission denied (publickey,gssapi-with-mic,password).
> error: error reading returncode of remote command
> error: error reading returncode of remote command

I wonder about these error messages. The to be called remote program  
in the C program is still set to a plain rsh? You can check whether  
the rsh-wrapper is in place, by putting a:

sleep 300

or so in the start startmpich2.sh script before the creation of the  
daemons and check the master node of the parallel job the $TMPDIR of  
the job.

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list