[GE users] TI of MPICH2 + SGE

Sangamesh B forum.san at gmail.com
Sat Oct 4 12:40:15 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

The error and output file contain:

[san at locuzcluster mpich2_smpd]$ cat e_23
Permission denied, please try again.
Permission denied, please try again.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied (publickey,gssapi-with-mic,password).
error: error reading returncode of remote command
enied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
error: error reading returncode of remote command
error: error reading returncode of remote command
[san at locuzcluster mpich2_smpd]$ cat o_23
-catch_rsh
/opt/gridengine/default/spool/compute-0-0/active_jobs/23.1/pe_hostfile
/opt/mpich2/gnu/smpd
SANGAMESH lx26-amd64
SANGAMESH /opt/mpich2/gnu/smpd
compute-0-0
compute-0-0
locuzcluster
locuzcluster
startmpich2.sh: check for smpd daemons (1 of 10)
/opt/gridengine/bin/lx26-amd64/qrsh -inherit locuzcluster
/opt/mpich2/gnu/smpd/bin/smpd -port 20023 -d 0
/opt/gridengine/bin/lx26-amd64/qrsh -inherit compute-0-0
/opt/mpich2/gnu/smpd/bin/smpd -port 20023 -d 0
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (2 of 10)
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (3 of 10)
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (4 of 10)
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (5 of 10)
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (6 of 10)
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (7 of 10)
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (8 of 10)
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (9 of 10)
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (10 of 10)
startmpich2.sh: missing smpd on compute-0-0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: got only 0 of 2 nodes, aborting
-catch_rsh /opt/mpich2/gnu/smpd
-catch_rsh
/opt/gridengine/default/spool/locuzcluster/active_jobs/23.1/pe_hostfile
/opt/mpich2/gnu/smpd
SANGAMESH ARCHITECTURE = lx26-amd64
SANGAMESH /opt/mpich2/gnu/smpd
locuzcluster
locuzcluster
locuzcluster
locuzcluster
startmpich2.sh: check for smpd daemons (1 of 10)
/opt/gridengine/bin/lx26-amd64/qrsh -inherit locuzcluster
/opt/mpich2/gnu/smpd/bin/smpd -port 20023 -d 0
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (2 of 10)
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (3 of 10)
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (4 of 10)
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (5 of 10)
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (6 of 10)
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (7 of 10)
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (8 of 10)
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (9 of 10)
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: check for smpd daemons (10 of 10)
startmpich2.sh: missing smpd on locuzcluster
startmpich2.sh: got only 0 of 1 nodes, aborting
-catch_rsh /opt/mpich2/gnu/smpd
[san at locuzcluster mpich2_smpd]$

On Sat, Oct 4, 2008 at 5:07 PM, Sangamesh B <forum.san at gmail.com> wrote:

> Dear Reuti,
>
>     I'm not getting where to put sleep 300.
>
> There are two C programs (mpihello.c & start_mpich2.c) and two shell
> scripts(startmpich2.sh and sge_job_script).
>
> As you mentioned sleep 300 ( not system("sleep 300")), I put it in the sge
> job script.
>
> But it couldn't execute and turned into error state.
>
> [san at locuzcluster mpich2_smpd]$ qsub -pe mpi_s_db 4 mpi_s_db.sh
> Your job 23 ("HELLO") has been submitted
>
> [san at locuzcluster mpich2_smpd]$ qstat -f
> queuename                      qtype used/tot. load_avg arch
> states
>
> ----------------------------------------------------------------------------
> all.q at compute-0-0.local        BIP   2/4       0.00     lx26-amd64
>      23 0.55500 HELLO      san          r     10/04/2008 16:44:06
> 2
>
> ----------------------------------------------------------------------------
> all.q at locuzcluster.org         BIP   2/4       0.01     lx26-amd64
>      23 0.55500 HELLO      san          r     10/04/2008 16:44:06
> 2
>
> ----------------------------------------------------------------------------
>
>
>
> [san at locuzcluster mpich2_smpd]$ qstat -f
> queuename                      qtype used/tot. load_avg arch
> states
>
> ----------------------------------------------------------------------------
> all.q at compute-0-0.local        BIP   0/4       0.01     lx26-amd64    E
>
> ----------------------------------------------------------------------------
> all.q at locuzcluster.org         BIP   0/4       0.03     lx26-amd64    E
>
> ----------------------------------------------------------------------------
>
>
> [san at locuzcluster mpich2_smpd]$ qstat -j 23
> ==============================================================
>
>
> ....
>
> parallel environment:  mpi_s_db range: 4
> error reason    1:          10/04/2008 16:44:56 [400:4819]: exit_status of
> pe_start = 1
>                 1:          10/04/2008 16:45:56 [400:6650]: exit_status of
> pe_start = 1
> scheduling info:            queue instance "all.q at compute-0-0.local"
> dropped because it is temporarily not available
>                             queue instance "all.q at locuzcluster.org"
> dropped because it is temporarily not available
>
> Thank you,
> Sangamesh
>
>
>
> On Sat, Oct 4, 2008 at 12:17 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
>
>> Hi,
>>
>> Am 01.10.2008 um 07:43 schrieb Sangamesh B:
>>
>>
>>  On Fri, Jul 11, 2008 at 3:50 PM, Reuti <reuti at staff.uni-marburg.de>
>>> wrote:
>>> Hiho,
>>>
>>> Am 11.07.2008 um 07:00 schrieb Sangamesh B:
>>>
>>>
>>>     I'm performing the tight integration of MPICH2 with Sun Grid Engine
>>> using smpd Process manager. Referred the document of Reuti available at:
>>>
>>> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
>>> integration.html
>>>
>>> The cluster has two nodes(1 Master + 1 compute, AMD64 dual core). But the
>>> compute node is not working.
>>> So I'm testing the TI only on Master Node. Is this ok?
>>>
>>> yes.
>>>
>>>
>>> Some environment setup is:
>>>
>>> SGE_ROOT=/opt/gridengine
>>>
>>> MPICH2_ROOT=/opt/libs/mpi/mpich2/1.0.7/smpd
>>>
>>> I've done all the steps mentioned in the document.
>>>
>>> But both daemonless and daemonbased tight integrations are not working.
>>>
>>> With daemonbased method, the error is:
>>>
>>> The thing to realize is the calling chain of the tools how it should be:
>>>
>>> - MPICH2 will call rsh when started when you set "MPIEXEC_RSH=rsh; export
>>> MPIEXEC_RSH"
>>> - rsh willl be caught by SGE's's RSH wrapper
>>> - SGE will start an RSH daemon per "qrsh -inherit ..." on a random port
>>>
>>> If you need to use SSH at all, you can it have this way:
>>>
>>> - configure SGE to use SSH instead of RSH (http://
>>> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
>>> - MPICH2 will call rsh
>>> - rsh willl be caught by SGE's RSH wrapper
>>> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random port
>>> (means: you could even instruct MPICH2 to call "blabla" and create a
>>> wrapper for "blabla" - at this stage it's just a name which could be set to
>>> anything)
>>>
>>> Or:
>>>
>>> - configure SGE to use SSH instead of RSH (http://
>>> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
>>> - configure the start_proc_args of the PE to create an SSH wrapper
>>> instead of an RSH wrapper
>>> - MPICH2 will call ssh
>>> - ssh willl be caught by SGE's SSH wrapper
>>> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random port
>>>
>>> In all cases: it will start on a random port. There no need either in one
>>> of the cases to have an rshd or sshd running all the time. SGE will start
>>> them for you.
>>>
>>> Question: is there any firewall blocking the traffic on certain ports
>>> (shouldn't prevent a local call anyways), or some setting in
>>> /etc/hosts.allow or /etc/hosts.deny?
>>>
>>> -- Reuti
>>> Hi Reuti,
>>>
>>> Thanks for the reply.
>>>
>>> Now rsh is working, and smpd Daemonless Tight Integration is also done.
>>>
>>> But facing problem with smpd daemonbased TI.
>>>
>>> Details:
>>>
>>> # qconf -sp mpich2_smpd_DB
>>> pe_name           mpich2_smpd_DB
>>> slots             999
>>> user_lists        NONE
>>> xuser_lists       NONE
>>> start_proc_args   /opt/gridengine/mpich2_smpd/
>>> startmpich2.sh -catch_rsh  \
>>>                  $pe_hostfile /opt/mpich2/gnu/smpd
>>> stop_proc_args    /opt/gridengine/mpich2_smpd/stopmpich2.sh -catch_rsh \
>>>                  /opt/mpich2/gnu/smpd
>>> allocation_rule   $round_robin
>>> control_slaves    TRUE
>>> job_is_first_task FALSE
>>> urgency_slots     min
>>>
>>>
>>> $ cat mpi_s_db.sh
>>> #!/bin/sh
>>>
>>> #export PATH=/home/reuti/local/mpich2_smpd/bin:$PATH
>>>
>>> port=$((JOB_ID % 5000 + 20000))
>>>
>>> #mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $port
>>> /home/reuti/mpihello
>>>
>>> /opt/mpich2/gnu/smpd/bin/mpiexec -n $NSLOTS -machinefile $TMPDIR/machines
>>> -port $port /home/san/mpich2_smpd/hellompi
>>>
>>> exit 0
>>>
>>> $ qsub -pe mpich2_smpd_DB 8 mpi_s_db.sh
>>> Your job 181 ("mpi_s_db.sh") has been submitted
>>>
>>>
>>> $ qstat -f
>>> queuename                      qtype used/tot. load_avg arch
>>>  states
>>>
>>> ----------------------------------------------------------------------------
>>> all.q at compute-0-0.local        BIP   0/4       0.00     lx26-amd64
>>>
>>> ----------------------------------------------------------------------------
>>> all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
>>>
>>> ----------------------------------------------------------------------------
>>> test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
>>>
>>> ----------------------------------------------------------------------------
>>> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>>>
>>>
>>> ############################################################################
>>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>>> JOBS
>>>
>>> ############################################################################
>>>    181 0.00000 mpi_s_db.s san          qw    09/30/2008 16:57:51     8
>>>
>>>
>>>
>>> $ qstat -f
>>> queuename                      qtype used/tot. load_avg arch
>>>  states
>>>
>>> ----------------------------------------------------------------------------
>>> all.q at compute-0-0.local        BIP   4/4       0.00     lx26-amd64
>>>    181 0.60500 mpi_s_db.s san          r     09/30/2008 16:57:59     4
>>>
>>> ----------------------------------------------------------------------------
>>> all.q at locuzcluster.local       BIP   4/4       0.02     lx26-amd64
>>>    181 0.60500 mpi_s_db.s san          r     09/30/2008 16:57:59     4
>>>
>>> ----------------------------------------------------------------------------
>>> test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
>>>
>>> ----------------------------------------------------------------------------
>>> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>>>
>>>
>>> $ qstat -f
>>> queuename                      qtype used/tot. load_avg arch
>>>  states
>>>
>>> ----------------------------------------------------------------------------
>>> all.q at compute-0-0.local        BIP   0/4       0.01     lx26-amd64    E
>>>
>>> ----------------------------------------------------------------------------
>>> all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
>>>
>>> ----------------------------------------------------------------------------
>>> test.q at compute-0-0.local       BIP   0/4       0.01     lx26-amd64
>>>
>>> ----------------------------------------------------------------------------
>>> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>>>
>>>
>>> ############################################################################
>>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>>> JOBS
>>>
>>> ############################################################################
>>>    181 0.60500 mpi_s_db.s san          qw    09/30/2008 16:57:51     8
>>>
>>>
>>> ERROR INFO
>>>
>>> $qstat  -j  181
>>> ..
>>>
>>> parallel environment:  mpich2_smpd_DB range: 8
>>> error reason    1:          09/30/2008 16:58:49 [400:868]: exit_status of
>>> pe_start = 1
>>> scheduling info:            queue instance "all.q at compute-0-0.local"
>>> dropped because it is temporarily not available
>>>                            cannot run in queue "test.q" because PE
>>> "mpich2_smpd_DB" is not in pe list
>>>                            cannot run in PE "mpich2_smpd_DB" because it
>>> only offers 4 slots
>>>
>>> $ cat mpi_s_db.sh.pe181
>>> Permission denied, please try again.
>>> Permission denied, please try again.
>>> Permission denied (publickey,gssapi-with-mic,password).
>>> Permission denied, please try again.
>>> Permission denied, please try again.
>>> Permission denied (publickey,gssapi-with-mic,password).
>>> error: error reading returncode of remote command
>>> error: error reading returncode of remote command
>>>
>>
>> I wonder about these error messages. The to be called remote program in
>> the C program is still set to a plain rsh? You can check whether the
>> rsh-wrapper is in place, by putting a:
>>
>> sleep 300
>>
>> or so in the start startmpich2.sh script before the creation of the
>> daemons and check the master node of the parallel job the $TMPDIR of the
>> job.
>>
>> -- Reuti
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>



More information about the gridengine-users mailing list