[GE users] TI of MPICH2 + SGE

Sangamesh B forum.san at gmail.com
Wed Oct 1 06:43:11 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

On Fri, Jul 11, 2008 at 3:50 PM, Reuti <reuti at staff.uni-marburg.de> wrote:

> Hiho,
>
> Am 11.07.2008 um 07:00 schrieb Sangamesh B:
>
>       I'm performing the tight integration of MPICH2 with Sun Grid Engine
>> using smpd Process manager. Referred the document of Reuti available at:
>>
>> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
>> integration.html
>>
>> The cluster has two nodes(1 Master + 1 compute, AMD64 dual core). But the
>> compute node is not working.
>> So I'm testing the TI only on Master Node. Is this ok?
>>
>
> yes.
>
>  Some environment setup is:
>>
>> SGE_ROOT=/opt/gridengine
>>
>> MPICH2_ROOT=/opt/libs/mpi/mpich2/1.0.7/smpd
>>
>> I've done all the steps mentioned in the document.
>>
>> But both daemonless and daemonbased tight integrations are not working.
>>
>> With daemonbased method, the error is:
>>
>
> The thing to realize is the calling chain of the tools how it should be:
>
> - MPICH2 will call rsh when started when you set "MPIEXEC_RSH=rsh; export
> MPIEXEC_RSH"
> - rsh willl be caught by SGE's's RSH wrapper
> - SGE will start an RSH daemon per "qrsh -inherit ..." on a random port
>
> If you need to use SSH at all, you can it have this way:
>
> - configure SGE to use SSH instead of RSH (http://
> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
> - MPICH2 will call rsh
> - rsh willl be caught by SGE's RSH wrapper
> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random port
> (means: you could even instruct MPICH2 to call "blabla" and create a
> wrapper for "blabla" - at this stage it's just a name which could be set to
> anything)
>
> Or:
>
> - configure SGE to use SSH instead of RSH (http://
> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
> - configure the start_proc_args of the PE to create an SSH wrapper instead
> of an RSH wrapper
> - MPICH2 will call ssh
> - ssh willl be caught by SGE's SSH wrapper
> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random port
>
> In all cases: it will start on a random port. There no need either in one
> of the cases to have an rshd or sshd running all the time. SGE will start
> them for you.
>
> Question: is there any firewall blocking the traffic on certain ports
> (shouldn't prevent a local call anyways), or some setting in
> /etc/hosts.allow or /etc/hosts.deny?
>
> -- Reuti
>
Hi Reuti,

Thanks for the reply.

Now rsh is working, and smpd Daemonless Tight Integration is also done.

But facing problem with smpd daemonbased TI.

Details:

# qconf -sp mpich2_smpd_DB
pe_name           mpich2_smpd_DB
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /opt/gridengine/mpich2_smpd/startmpich2.sh -catch_rsh  \
                  $pe_hostfile /opt/mpich2/gnu/smpd
stop_proc_args    /opt/gridengine/mpich2_smpd/stopmpich2.sh -catch_rsh \
                  /opt/mpich2/gnu/smpd
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min


$ cat mpi_s_db.sh
#!/bin/sh

#export PATH=/home/reuti/local/mpich2_smpd/bin:$PATH

port=$((JOB_ID % 5000 + 20000))

#mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $port
/home/reuti/mpihello

/opt/mpich2/gnu/smpd/bin/mpiexec -n $NSLOTS -machinefile $TMPDIR/machines
-port $port /home/san/mpich2_smpd/hellompi

exit 0

$ qsub -pe mpich2_smpd_DB 8 mpi_s_db.sh
Your job 181 ("mpi_s_db.sh") has been submitted

$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   0/4       0.00     lx26-amd64
----------------------------------------------------------------------------
all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
----------------------------------------------------------------------------
test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
----------------------------------------------------------------------------
test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    181 0.00000 mpi_s_db.s san          qw    09/30/2008 16:57:51     8


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   4/4       0.00     lx26-amd64
    181 0.60500 mpi_s_db.s san          r     09/30/2008 16:57:59
4
----------------------------------------------------------------------------
all.q at locuzcluster.local       BIP   4/4       0.02     lx26-amd64
    181 0.60500 mpi_s_db.s san          r     09/30/2008 16:57:59
4
----------------------------------------------------------------------------
test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
----------------------------------------------------------------------------
test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   0/4       0.01     lx26-amd64    E
----------------------------------------------------------------------------
all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
----------------------------------------------------------------------------
test.q at compute-0-0.local       BIP   0/4       0.01     lx26-amd64
----------------------------------------------------------------------------
test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    181 0.60500 mpi_s_db.s san          qw    09/30/2008 16:57:51     8


ERROR INFO

$qstat  -j  181
..

parallel environment:  mpich2_smpd_DB range: 8
error reason    1:          09/30/2008 16:58:49 [400:868]: exit_status of
pe_start = 1
scheduling info:            queue instance "all.q at compute-0-0.local" dropped
because it is temporarily not available
                            cannot run in queue "test.q" because PE
"mpich2_smpd_DB" is not in pe list
                            cannot run in PE "mpich2_smpd_DB" because it
only offers 4 slots

$ cat mpi_s_db.sh.pe181
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
error: error reading returncode of remote command
error: error reading returncode of remote command

$ cat mpi_s_db.sh.po181
-catch_rsh
/opt/gridengine/default/spool/compute-0-0/active_jobs/181.1/pe_hostfile
/opt/mpich2/gnu/smpd
compute-0-0
compute-0-0
compute-0-0
compute-0-0
locuzcluster
locuzcluster
locuzcluster
locuzcluster
startmpich2.sh: check for smpd daemons (1 of 10)
/opt/gridengine/bin/lx26-amd64/qrsh -inherit compute-0-0
/opt/mpich2/gnu/smpd/bin/smpd -port 20181 -d 0
/opt/gridengine/bin/lx26-amd64/qrsh -inherit locuzcluster
/opt/mpich2/gnu/smpd/bin/smpd -port 20181 -d 0
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (2 of 10)
startmpich2.sh: missing smpd on compute-0-0

..
..

startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: got only 0 of 2 nodes, aborting
-catch_rsh /opt/mpich2/gnu/smpd

What's going wrong here?

The version of MPICH2 is 1.0.7.



>
>
>
>  [sangamesh at test progs]$ qstat -f
>> queuename                      qtype used/tot. load_avg arch
>>  states
>>
>> ----------------------------------------------------------------------------
>> all.q at test.local               BIP   0/4       0.07     lx26-amd64
>>
>>
>> ############################################################################
>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>> JOBS
>>
>> ############################################################################
>>    224 0.00000 mpich2_d_s sangamesh    qw    07/11/2008 10:18:48     3
>> [sangamesh at test progs]$ qstat -f
>> queuename                      qtype used/tot. load_avg arch
>>  states
>>
>> ----------------------------------------------------------------------------
>> all.q at test.local               BIP   3/4       0.07     lx26-amd64
>>    224 0.55500 mpich2_d_s sangamesh    dr    07/11/2008 10:18:54     3
>> [sangamesh at test progs]
>>
>>
>>
>> [sangamesh at test progs]$ cat ERR.224
>> poll: protocol failure in circuit setup
>> [sangamesh at test progs]$ cat OUT.224
>> -catch_rsh
>> /opt/gridengine/default/spool/test/active_jobs/224.1/pe_hostfile
>> /opt/libs/mpi/mpich2/1.0.7/smpd
>> SANGAMESH MPICH2_ROOT /opt/libs/mpi/mpich2/1.0.7/smpd
>> SANGAMESH: ARCHITECTURE: lx26-amd64
>> test
>> test
>> test
>> startmpich2.sh: check for smpd daemons (1 of 10)
>>  SANGAMESH 0
>> /opt/gridengine/bin/lx26-amd64/qrsh -inherit test
>> /opt/libs/mpi/mpich2/1.0.7/smpd/bin/smpd -port 20224 -d 0
>> startmpich2.sh: missing smpd on test
>>  SANGAMESH 0 second
>> startmpich2.sh: check for smpd daemons (2 of 10)
>>  SANGAMESH 0
>> startmpich2.sh: missing smpd on test
>>  SANGAMESH 0 second
>> startmpich2.sh: check for smpd daemons (3 of 10)
>>  SANGAMESH 0
>> startmpich2.sh: missing smpd on test
>>  SANGAMESH 0 second
>> startmpich2.sh: check for smpd daemons (4 of 10)
>>  SANGAMESH 0
>> startmpich2.sh: missing smpd on test
>>  SANGAMESH 0 second
>> startmpich2.sh: check for smpd daemons (5 of 10)
>>  SANGAMESH 0
>> startmpich2.sh: missing smpd on test
>>  SANGAMESH 0 second
>> startmpich2.sh: check for smpd daemons (6 of 10)
>>  SANGAMESH 0
>> startmpich2.sh: missing smpd on test
>>  SANGAMESH 0 second
>> [sangamesh at test progs]$
>>
>> By tarcing the $SGE_ROOT/mpich2_smpd/startmpich2.sh script, I found that
>> its not able to launch the smpd.
>>
>> And,
>>
>> [sangamesh at test progs]$ cat /etc/hosts
>> #
>> # Do NOT Edit (generated by dbreport)
>> #
>> 127.0.0.1       localhost.localdomain   localhost
>> 10.1.1.1        test.local test # originally frontend-0-0
>> 10.1.1.254      compute-0-0.local compute-0-0 c0-0
>> 10.129.150.45   test.locuzcluster.org
>>
>> If I do, ssh to the same node, doesn't prompt for password:
>>
>> [sangamesh at test progs]$ ssh 10.1.1.1
>> Last login: Fri Jul 11 09:40:02 2008 from 10.129.150.63
>> Rocks Frontend Node - Locuzcluster Cluster
>> Rocks 4.3 (Mars Hill)
>> Profile built 11:06 17-Apr-2008
>>
>> Kickstarted 17:09 17-Apr-2008
>> [sangamesh at test ~]$
>>
>>
>> If I use rsh:
>>
>> [sangamesh at test progs]$ rsh 10.1.1.1
>> connect to address 10.1.1.1: Connection refused
>> Trying krb4 rlogin...
>> connect to address 10.1.1.1: Connection refused
>> trying normal rlogin (/usr/bin/rlogin)
>> Last login: Fri Jul 11 10:16:56 from test.local
>> Rocks Frontend Node - Locuzcluster Cluster
>> Rocks 4.3 (Mars Hill)
>> Profile built 11:06 17-Apr-2008
>>
>> Kickstarted 17:09 17-Apr-2008
>> [sangamesh at test ~]$
>>
>> I'm not getting the cause of the error.
>>
>> So can any one on the list help me to resolve the issue?
>>
>> Thank you,
>> Sangamesh
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list