[GE users] TI of MPICH2 + SGE

Sangamesh B forum.san at gmail.com
Fri Jul 11 06:00:46 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello all,

      I'm performing the tight integration of MPICH2 with Sun Grid Engine
using smpd Process manager. Referred the document of Reuti available at:

http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html

The cluster has two nodes(1 Master + 1 compute, AMD64 dual core). But the
compute node is not working.
So I'm testing the TI only on Master Node. Is this ok?

Some environment setup is:

SGE_ROOT=/opt/gridengine

MPICH2_ROOT=/opt/libs/mpi/mpich2/1.0.7/smpd

I've done all the steps mentioned in the document.

But both daemonless and daemonbased tight integrations are not working.

With daemonbased method, the error is:

[sangamesh at test progs]$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at test.local               BIP   0/4       0.07     lx26-amd64

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    224 0.00000 mpich2_d_s sangamesh    qw    07/11/2008 10:18:48     3
[sangamesh at test progs]$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at test.local               BIP   3/4       0.07     lx26-amd64
    224 0.55500 mpich2_d_s sangamesh    dr    07/11/2008 10:18:54     3
[sangamesh at test progs]



[sangamesh at test progs]$ cat ERR.224
poll: protocol failure in circuit setup
[sangamesh at test progs]$ cat OUT.224
-catch_rsh /opt/gridengine/default/spool/test/active_jobs/224.1/pe_hostfile
/opt/libs/mpi/mpich2/1.0.7/smpd
SANGAMESH MPICH2_ROOT /opt/libs/mpi/mpich2/1.0.7/smpd
SANGAMESH: ARCHITECTURE: lx26-amd64
test
test
test
startmpich2.sh: check for smpd daemons (1 of 10)
 SANGAMESH 0
/opt/gridengine/bin/lx26-amd64/qrsh -inherit test
/opt/libs/mpi/mpich2/1.0.7/smpd/bin/smpd -port 20224 -d 0
startmpich2.sh: missing smpd on test
 SANGAMESH 0 second
startmpich2.sh: check for smpd daemons (2 of 10)
 SANGAMESH 0
startmpich2.sh: missing smpd on test
 SANGAMESH 0 second
startmpich2.sh: check for smpd daemons (3 of 10)
 SANGAMESH 0
startmpich2.sh: missing smpd on test
 SANGAMESH 0 second
startmpich2.sh: check for smpd daemons (4 of 10)
 SANGAMESH 0
startmpich2.sh: missing smpd on test
 SANGAMESH 0 second
startmpich2.sh: check for smpd daemons (5 of 10)
 SANGAMESH 0
startmpich2.sh: missing smpd on test
 SANGAMESH 0 second
startmpich2.sh: check for smpd daemons (6 of 10)
 SANGAMESH 0
startmpich2.sh: missing smpd on test
 SANGAMESH 0 second
[sangamesh at test progs]$

By tarcing the $SGE_ROOT/mpich2_smpd/startmpich2.sh script, I found that its
not able to launch the smpd.

And,

[sangamesh at test progs]$ cat /etc/hosts
#
# Do NOT Edit (generated by dbreport)
#
127.0.0.1       localhost.localdomain   localhost
10.1.1.1        test.local test # originally frontend-0-0
10.1.1.254      compute-0-0.local compute-0-0 c0-0
10.129.150.45   test.locuzcluster.org

If I do, ssh to the same node, doesn't prompt for password:

[sangamesh at test progs]$ ssh 10.1.1.1
Last login: Fri Jul 11 09:40:02 2008 from 10.129.150.63
Rocks Frontend Node - Locuzcluster Cluster
Rocks 4.3 (Mars Hill)
Profile built 11:06 17-Apr-2008

Kickstarted 17:09 17-Apr-2008
[sangamesh at test ~]$


If I use rsh:

[sangamesh at test progs]$ rsh 10.1.1.1
connect to address 10.1.1.1: Connection refused
Trying krb4 rlogin...
connect to address 10.1.1.1: Connection refused
trying normal rlogin (/usr/bin/rlogin)
Last login: Fri Jul 11 10:16:56 from test.local
Rocks Frontend Node - Locuzcluster Cluster
Rocks 4.3 (Mars Hill)
Profile built 11:06 17-Apr-2008

Kickstarted 17:09 17-Apr-2008
[sangamesh at test ~]$

I'm not getting the cause of the error.

So can any one on the list help me to resolve the issue?

Thank you,
Sangamesh



More information about the gridengine-users mailing list