[GE users] TI of MPICH2 + SGE

Sangamesh B forum.san at gmail.com
Sat Oct 4 12:37:03 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Dear Reuti,

    I'm not getting where to put sleep 300.

There are two C programs (mpihello.c & start_mpich2.c) and two shell
scripts(startmpich2.sh and sge_job_script).

As you mentioned sleep 300 ( not system("sleep 300")), I put it in the sge
job script.

But it couldn't execute and turned into error state.

[san at locuzcluster mpich2_smpd]$ qsub -pe mpi_s_db 4 mpi_s_db.sh
Your job 23 ("HELLO") has been submitted

[san at locuzcluster mpich2_smpd]$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   2/4       0.00     lx26-amd64
     23 0.55500 HELLO      san          r     10/04/2008 16:44:06
2
----------------------------------------------------------------------------
all.q at locuzcluster.org         BIP   2/4       0.01     lx26-amd64
     23 0.55500 HELLO      san          r     10/04/2008 16:44:06
2
----------------------------------------------------------------------------



[san at locuzcluster mpich2_smpd]$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at compute-0-0.local        BIP   0/4       0.01     lx26-amd64    E
----------------------------------------------------------------------------
all.q at locuzcluster.org         BIP   0/4       0.03     lx26-amd64    E
----------------------------------------------------------------------------


[san at locuzcluster mpich2_smpd]$ qstat -j 23
==============================================================


....

parallel environment:  mpi_s_db range: 4
error reason    1:          10/04/2008 16:44:56 [400:4819]: exit_status of
pe_start = 1
                1:          10/04/2008 16:45:56 [400:6650]: exit_status of
pe_start = 1
scheduling info:            queue instance "all.q at compute-0-0.local" dropped
because it is temporarily not available
                            queue instance "all.q at locuzcluster.org" dropped
because it is temporarily not available

Thank you,
Sangamesh


On Sat, Oct 4, 2008 at 12:17 AM, Reuti <reuti at staff.uni-marburg.de> wrote:

> Hi,
>
> Am 01.10.2008 um 07:43 schrieb Sangamesh B:
>
>
>  On Fri, Jul 11, 2008 at 3:50 PM, Reuti <reuti at staff.uni-marburg.de>
>> wrote:
>> Hiho,
>>
>> Am 11.07.2008 um 07:00 schrieb Sangamesh B:
>>
>>
>>     I'm performing the tight integration of MPICH2 with Sun Grid Engine
>> using smpd Process manager. Referred the document of Reuti available at:
>>
>> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
>> integration.html
>>
>> The cluster has two nodes(1 Master + 1 compute, AMD64 dual core). But the
>> compute node is not working.
>> So I'm testing the TI only on Master Node. Is this ok?
>>
>> yes.
>>
>>
>> Some environment setup is:
>>
>> SGE_ROOT=/opt/gridengine
>>
>> MPICH2_ROOT=/opt/libs/mpi/mpich2/1.0.7/smpd
>>
>> I've done all the steps mentioned in the document.
>>
>> But both daemonless and daemonbased tight integrations are not working.
>>
>> With daemonbased method, the error is:
>>
>> The thing to realize is the calling chain of the tools how it should be:
>>
>> - MPICH2 will call rsh when started when you set "MPIEXEC_RSH=rsh; export
>> MPIEXEC_RSH"
>> - rsh willl be caught by SGE's's RSH wrapper
>> - SGE will start an RSH daemon per "qrsh -inherit ..." on a random port
>>
>> If you need to use SSH at all, you can it have this way:
>>
>> - configure SGE to use SSH instead of RSH (http://
>> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
>> - MPICH2 will call rsh
>> - rsh willl be caught by SGE's RSH wrapper
>> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random port
>> (means: you could even instruct MPICH2 to call "blabla" and create a
>> wrapper for "blabla" - at this stage it's just a name which could be set to
>> anything)
>>
>> Or:
>>
>> - configure SGE to use SSH instead of RSH (http://
>> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
>> - configure the start_proc_args of the PE to create an SSH wrapper instead
>> of an RSH wrapper
>> - MPICH2 will call ssh
>> - ssh willl be caught by SGE's SSH wrapper
>> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random port
>>
>> In all cases: it will start on a random port. There no need either in one
>> of the cases to have an rshd or sshd running all the time. SGE will start
>> them for you.
>>
>> Question: is there any firewall blocking the traffic on certain ports
>> (shouldn't prevent a local call anyways), or some setting in
>> /etc/hosts.allow or /etc/hosts.deny?
>>
>> -- Reuti
>> Hi Reuti,
>>
>> Thanks for the reply.
>>
>> Now rsh is working, and smpd Daemonless Tight Integration is also done.
>>
>> But facing problem with smpd daemonbased TI.
>>
>> Details:
>>
>> # qconf -sp mpich2_smpd_DB
>> pe_name           mpich2_smpd_DB
>> slots             999
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /opt/gridengine/mpich2_smpd/
>> startmpich2.sh -catch_rsh  \
>>                  $pe_hostfile /opt/mpich2/gnu/smpd
>> stop_proc_args    /opt/gridengine/mpich2_smpd/stopmpich2.sh -catch_rsh \
>>                  /opt/mpich2/gnu/smpd
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>>
>>
>> $ cat mpi_s_db.sh
>> #!/bin/sh
>>
>> #export PATH=/home/reuti/local/mpich2_smpd/bin:$PATH
>>
>> port=$((JOB_ID % 5000 + 20000))
>>
>> #mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $port
>> /home/reuti/mpihello
>>
>> /opt/mpich2/gnu/smpd/bin/mpiexec -n $NSLOTS -machinefile $TMPDIR/machines
>> -port $port /home/san/mpich2_smpd/hellompi
>>
>> exit 0
>>
>> $ qsub -pe mpich2_smpd_DB 8 mpi_s_db.sh
>> Your job 181 ("mpi_s_db.sh") has been submitted
>>
>>
>> $ qstat -f
>> queuename                      qtype used/tot. load_avg arch
>>  states
>>
>> ----------------------------------------------------------------------------
>> all.q at compute-0-0.local        BIP   0/4       0.00     lx26-amd64
>>
>> ----------------------------------------------------------------------------
>> all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
>>
>> ----------------------------------------------------------------------------
>> test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
>>
>> ----------------------------------------------------------------------------
>> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>>
>>
>> ############################################################################
>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>> JOBS
>>
>> ############################################################################
>>    181 0.00000 mpi_s_db.s san          qw    09/30/2008 16:57:51     8
>>
>>
>>
>> $ qstat -f
>> queuename                      qtype used/tot. load_avg arch
>>  states
>>
>> ----------------------------------------------------------------------------
>> all.q at compute-0-0.local        BIP   4/4       0.00     lx26-amd64
>>    181 0.60500 mpi_s_db.s san          r     09/30/2008 16:57:59     4
>>
>> ----------------------------------------------------------------------------
>> all.q at locuzcluster.local       BIP   4/4       0.02     lx26-amd64
>>    181 0.60500 mpi_s_db.s san          r     09/30/2008 16:57:59     4
>>
>> ----------------------------------------------------------------------------
>> test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
>>
>> ----------------------------------------------------------------------------
>> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>>
>>
>> $ qstat -f
>> queuename                      qtype used/tot. load_avg arch
>>  states
>>
>> ----------------------------------------------------------------------------
>> all.q at compute-0-0.local        BIP   0/4       0.01     lx26-amd64    E
>>
>> ----------------------------------------------------------------------------
>> all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
>>
>> ----------------------------------------------------------------------------
>> test.q at compute-0-0.local       BIP   0/4       0.01     lx26-amd64
>>
>> ----------------------------------------------------------------------------
>> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>>
>>
>> ############################################################################
>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>> JOBS
>>
>> ############################################################################
>>    181 0.60500 mpi_s_db.s san          qw    09/30/2008 16:57:51     8
>>
>>
>> ERROR INFO
>>
>> $qstat  -j  181
>> ..
>>
>> parallel environment:  mpich2_smpd_DB range: 8
>> error reason    1:          09/30/2008 16:58:49 [400:868]: exit_status of
>> pe_start = 1
>> scheduling info:            queue instance "all.q at compute-0-0.local"
>> dropped because it is temporarily not available
>>                            cannot run in queue "test.q" because PE
>> "mpich2_smpd_DB" is not in pe list
>>                            cannot run in PE "mpich2_smpd_DB" because it
>> only offers 4 slots
>>
>> $ cat mpi_s_db.sh.pe181
>> Permission denied, please try again.
>> Permission denied, please try again.
>> Permission denied (publickey,gssapi-with-mic,password).
>> Permission denied, please try again.
>> Permission denied, please try again.
>> Permission denied (publickey,gssapi-with-mic,password).
>> error: error reading returncode of remote command
>> error: error reading returncode of remote command
>>
>
> I wonder about these error messages. The to be called remote program in the
> C program is still set to a plain rsh? You can check whether the rsh-wrapper
> is in place, by putting a:
>
> sleep 300
>
> or so in the start startmpich2.sh script before the creation of the daemons
> and check the master node of the parallel job the $TMPDIR of the job.
>
> -- Reuti
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list