[GE users] TI of MPICH2 + SGE

Reuti reuti at staff.uni-marburg.de
Fri Jul 11 11:20:52 BST 2008


Hiho,

Am 11.07.2008 um 07:00 schrieb Sangamesh B:

>       I'm performing the tight integration of MPICH2 with Sun Grid  
> Engine using smpd Process manager. Referred the document of Reuti  
> available at:
>
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
> integration.html
>
> The cluster has two nodes(1 Master + 1 compute, AMD64 dual core).  
> But the compute node is not working.
> So I'm testing the TI only on Master Node. Is this ok?

yes.

> Some environment setup is:
>
> SGE_ROOT=/opt/gridengine
>
> MPICH2_ROOT=/opt/libs/mpi/mpich2/1.0.7/smpd
>
> I've done all the steps mentioned in the document.
>
> But both daemonless and daemonbased tight integrations are not  
> working.
>
> With daemonbased method, the error is:

The thing to realize is the calling chain of the tools how it should be:

- MPICH2 will call rsh when started when you set "MPIEXEC_RSH=rsh;  
export MPIEXEC_RSH"
- rsh willl be caught by SGE's's RSH wrapper
- SGE will start an RSH daemon per "qrsh -inherit ..." on a random port

If you need to use SSH at all, you can it have this way:

- configure SGE to use SSH instead of RSH (http:// 
gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
- MPICH2 will call rsh
- rsh willl be caught by SGE's RSH wrapper
- SGE will start an SSH daemon per "qrsh -inherit ..." on a random port
(means: you could even instruct MPICH2 to call "blabla" and create a  
wrapper for "blabla" - at this stage it's just a name which could be  
set to anything)

Or:

- configure SGE to use SSH instead of RSH (http:// 
gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
- configure the start_proc_args of the PE to create an SSH wrapper  
instead of an RSH wrapper
- MPICH2 will call ssh
- ssh willl be caught by SGE's SSH wrapper
- SGE will start an SSH daemon per "qrsh -inherit ..." on a random port

In all cases: it will start on a random port. There no need either in  
one of the cases to have an rshd or sshd running all the time. SGE  
will start them for you.

Question: is there any firewall blocking the traffic on certain ports  
(shouldn't prevent a local call anyways), or some setting in /etc/ 
hosts.allow or /etc/hosts.deny?

-- Reuti


> [sangamesh at test progs]$ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at test.local               BIP   0/4       0.07     lx26-amd64
>
> ###################################################################### 
> ######
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
> ###################################################################### 
> ######
>     224 0.00000 mpich2_d_s sangamesh    qw    07/11/2008  
> 10:18:48     3
> [sangamesh at test progs]$ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at test.local               BIP   3/4       0.07     lx26-amd64
>     224 0.55500 mpich2_d_s sangamesh    dr    07/11/2008  
> 10:18:54     3
> [sangamesh at test progs]
>
>
>
> [sangamesh at test progs]$ cat ERR.224
> poll: protocol failure in circuit setup
> [sangamesh at test progs]$ cat OUT.224
> -catch_rsh /opt/gridengine/default/spool/test/active_jobs/224.1/ 
> pe_hostfile /opt/libs/mpi/mpich2/1.0.7/smpd
> SANGAMESH MPICH2_ROOT /opt/libs/mpi/mpich2/1.0.7/smpd
> SANGAMESH: ARCHITECTURE: lx26-amd64
> test
> test
> test
> startmpich2.sh: check for smpd daemons (1 of 10)
>  SANGAMESH 0
> /opt/gridengine/bin/lx26-amd64/qrsh -inherit test /opt/libs/mpi/ 
> mpich2/1.0.7/smpd/bin/smpd -port 20224 -d 0
> startmpich2.sh: missing smpd on test
>  SANGAMESH 0 second
> startmpich2.sh: check for smpd daemons (2 of 10)
>  SANGAMESH 0
> startmpich2.sh: missing smpd on test
>  SANGAMESH 0 second
> startmpich2.sh: check for smpd daemons (3 of 10)
>  SANGAMESH 0
> startmpich2.sh: missing smpd on test
>  SANGAMESH 0 second
> startmpich2.sh: check for smpd daemons (4 of 10)
>  SANGAMESH 0
> startmpich2.sh: missing smpd on test
>  SANGAMESH 0 second
> startmpich2.sh: check for smpd daemons (5 of 10)
>  SANGAMESH 0
> startmpich2.sh: missing smpd on test
>  SANGAMESH 0 second
> startmpich2.sh: check for smpd daemons (6 of 10)
>  SANGAMESH 0
> startmpich2.sh: missing smpd on test
>  SANGAMESH 0 second
> [sangamesh at test progs]$
>
> By tarcing the $SGE_ROOT/mpich2_smpd/startmpich2.sh script, I found  
> that its not able to launch the smpd.
>
> And,
>
> [sangamesh at test progs]$ cat /etc/hosts
> #
> # Do NOT Edit (generated by dbreport)
> #
> 127.0.0.1       localhost.localdomain   localhost
> 10.1.1.1        test.local test # originally frontend-0-0
> 10.1.1.254      compute-0-0.local compute-0-0 c0-0
> 10.129.150.45   test.locuzcluster.org
>
> If I do, ssh to the same node, doesn't prompt for password:
>
> [sangamesh at test progs]$ ssh 10.1.1.1
> Last login: Fri Jul 11 09:40:02 2008 from 10.129.150.63
> Rocks Frontend Node - Locuzcluster Cluster
> Rocks 4.3 (Mars Hill)
> Profile built 11:06 17-Apr-2008
>
> Kickstarted 17:09 17-Apr-2008
> [sangamesh at test ~]$
>
>
> If I use rsh:
>
> [sangamesh at test progs]$ rsh 10.1.1.1
> connect to address 10.1.1.1: Connection refused
> Trying krb4 rlogin...
> connect to address 10.1.1.1: Connection refused
> trying normal rlogin (/usr/bin/rlogin)
> Last login: Fri Jul 11 10:16:56 from test.local
> Rocks Frontend Node - Locuzcluster Cluster
> Rocks 4.3 (Mars Hill)
> Profile built 11:06 17-Apr-2008
>
> Kickstarted 17:09 17-Apr-2008
> [sangamesh at test ~]$
>
> I'm not getting the cause of the error.
>
> So can any one on the list help me to resolve the issue?
>
> Thank you,
> Sangamesh
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list