[GE users] TI of MPICH2 + SGE

Reuti reuti at staff.uni-marburg.de
Sat Oct 4 12:51:44 BST 2008


Am 04.10.2008 um 13:37 schrieb Sangamesh B:

> Dear Reuti,
>
>     I'm not getting where to put sleep 300.
>
> There are two C programs (mpihello.c & start_mpich2.c) and two  
> shell scripts(startmpich2.sh and sge_job_script).
>
> As you mentioned sleep 300 ( not system("sleep 300")), I put it in  
> the sge job script.

No, I mentioned to put it in startmpich2.sh before the startup of the  
daemons, i.e. before the loop near the end, and while the PE is  
sleeping during the PE startup to check the $TMPDIR of the job.

-- Reuti

>
>
> But it couldn't execute and turned into error state.
>
> [san at locuzcluster mpich2_smpd]$ qsub -pe mpi_s_db 4 mpi_s_db.sh
> Your job 23 ("HELLO") has been submitted
>
> [san at locuzcluster mpich2_smpd]$ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   2/4       0.00     lx26-amd64
>      23 0.55500 HELLO      san          r     10/04/2008  
> 16:44:06     2
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.org         BIP   2/4       0.01     lx26-amd64
>      23 0.55500 HELLO      san          r     10/04/2008  
> 16:44:06     2
> ---------------------------------------------------------------------- 
> ------
>
>
>
> [san at locuzcluster mpich2_smpd]$ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   0/4       0.01     lx26- 
> amd64    E
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.org         BIP   0/4       0.03     lx26- 
> amd64    E
> ---------------------------------------------------------------------- 
> ------
>
>
> [san at locuzcluster mpich2_smpd]$ qstat -j 23
> ==============================================================
>
>
> ....
>
> parallel environment:  mpi_s_db range: 4
> error reason    1:          10/04/2008 16:44:56 [400:4819]:  
> exit_status of pe_start = 1
>                 1:          10/04/2008 16:45:56 [400:6650]:  
> exit_status of pe_start = 1
> scheduling info:            queue instance  
> "all.q at compute-0-0.local" dropped because it is temporarily not  
> available
>                             queue instance "all.q at locuzcluster.org"  
> dropped because it is temporarily not available
>
> Thank you,
> Sangamesh
>
>
> On Sat, Oct 4, 2008 at 12:17 AM, Reuti <reuti at staff.uni-marburg.de>  
> wrote:
> Hi,
>
> Am 01.10.2008 um 07:43 schrieb Sangamesh B:
>
>
> On Fri, Jul 11, 2008 at 3:50 PM, Reuti <reuti at staff.uni-marburg.de>  
> wrote:
> Hiho,
>
> Am 11.07.2008 um 07:00 schrieb Sangamesh B:
>
>
>     I'm performing the tight integration of MPICH2 with Sun Grid  
> Engine using smpd Process manager. Referred the document of Reuti  
> available at:
>
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
> integration.html
>
> The cluster has two nodes(1 Master + 1 compute, AMD64 dual core).  
> But the compute node is not working.
> So I'm testing the TI only on Master Node. Is this ok?
>
> yes.
>
>
> Some environment setup is:
>
> SGE_ROOT=/opt/gridengine
>
> MPICH2_ROOT=/opt/libs/mpi/mpich2/1.0.7/smpd
>
> I've done all the steps mentioned in the document.
>
> But both daemonless and daemonbased tight integrations are not  
> working.
>
> With daemonbased method, the error is:
>
> The thing to realize is the calling chain of the tools how it  
> should be:
>
> - MPICH2 will call rsh when started when you set "MPIEXEC_RSH=rsh;  
> export MPIEXEC_RSH"
> - rsh willl be caught by SGE's's RSH wrapper
> - SGE will start an RSH daemon per "qrsh -inherit ..." on a random  
> port
>
> If you need to use SSH at all, you can it have this way:
>
> - configure SGE to use SSH instead of RSH (http:// 
> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
> - MPICH2 will call rsh
> - rsh willl be caught by SGE's RSH wrapper
> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random  
> port
> (means: you could even instruct MPICH2 to call "blabla" and create  
> a wrapper for "blabla" - at this stage it's just a name which could  
> be set to anything)
>
> Or:
>
> - configure SGE to use SSH instead of RSH (http:// 
> gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html)
> - configure the start_proc_args of the PE to create an SSH wrapper  
> instead of an RSH wrapper
> - MPICH2 will call ssh
> - ssh willl be caught by SGE's SSH wrapper
> - SGE will start an SSH daemon per "qrsh -inherit ..." on a random  
> port
>
> In all cases: it will start on a random port. There no need either  
> in one of the cases to have an rshd or sshd running all the time.  
> SGE will start them for you.
>
> Question: is there any firewall blocking the traffic on certain  
> ports (shouldn't prevent a local call anyways), or some setting in / 
> etc/hosts.allow or /etc/hosts.deny?
>
> -- Reuti
> Hi Reuti,
>
> Thanks for the reply.
>
> Now rsh is working, and smpd Daemonless Tight Integration is also  
> done.
>
> But facing problem with smpd daemonbased TI.
>
> Details:
>
> # qconf -sp mpich2_smpd_DB
> pe_name           mpich2_smpd_DB
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/gridengine/mpich2_smpd/
> startmpich2.sh -catch_rsh  \
>                  $pe_hostfile /opt/mpich2/gnu/smpd
> stop_proc_args    /opt/gridengine/mpich2_smpd/stopmpich2.sh - 
> catch_rsh \
>                  /opt/mpich2/gnu/smpd
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
>
> $ cat mpi_s_db.sh
> #!/bin/sh
>
> #export PATH=/home/reuti/local/mpich2_smpd/bin:$PATH
>
> port=$((JOB_ID % 5000 + 20000))
>
> #mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $port /home/ 
> reuti/mpihello
>
> /opt/mpich2/gnu/smpd/bin/mpiexec -n $NSLOTS -machinefile $TMPDIR/ 
> machines -port $port /home/san/mpich2_smpd/hellompi
>
> exit 0
>
> $ qsub -pe mpich2_smpd_DB 8 mpi_s_db.sh
> Your job 181 ("mpi_s_db.sh") has been submitted
>
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   0/4       0.00     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>
> ###################################################################### 
> ######
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
> ###################################################################### 
> ######
>    181 0.00000 mpi_s_db.s san          qw    09/30/2008 16:57:51     8
>
>
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   4/4       0.00     lx26-amd64
>    181 0.60500 mpi_s_db.s san          r     09/30/2008 16:57:59     4
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.local       BIP   4/4       0.02     lx26-amd64
>    181 0.60500 mpi_s_db.s san          r     09/30/2008 16:57:59     4
> ---------------------------------------------------------------------- 
> ------
> test.q at compute-0-0.local       BIP   0/4       0.00     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>
>
> $ qstat -f
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-0.local        BIP   0/4       0.01     lx26- 
> amd64    E
> ---------------------------------------------------------------------- 
> ------
> all.q at locuzcluster.local       BIP   0/4       0.02     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at compute-0-0.local       BIP   0/4       0.01     lx26-amd64
> ---------------------------------------------------------------------- 
> ------
> test.q at locuzcluster.org        BIP   0/4       0.02     lx26-amd64
>
> ###################################################################### 
> ######
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
> ###################################################################### 
> ######
>    181 0.60500 mpi_s_db.s san          qw    09/30/2008 16:57:51     8
>
>
> ERROR INFO
>
> $qstat  -j  181
> ..
>
> parallel environment:  mpich2_smpd_DB range: 8
> error reason    1:          09/30/2008 16:58:49 [400:868]:  
> exit_status of pe_start = 1
> scheduling info:            queue instance  
> "all.q at compute-0-0.local" dropped because it is temporarily not  
> available
>                            cannot run in queue "test.q" because PE  
> "mpich2_smpd_DB" is not in pe list
>                            cannot run in PE "mpich2_smpd_DB"  
> because it only offers 4 slots
>
> $ cat mpi_s_db.sh.pe181
> Permission denied, please try again.
> Permission denied, please try again.
> Permission denied (publickey,gssapi-with-mic,password).
> Permission denied, please try again.
> Permission denied, please try again.
> Permission denied (publickey,gssapi-with-mic,password).
> error: error reading returncode of remote command
> error: error reading returncode of remote command
>
> I wonder about these error messages. The to be called remote  
> program in the C program is still set to a plain rsh? You can check  
> whether the rsh-wrapper is in place, by putting a:
>
> sleep 300
>
> or so in the start startmpich2.sh script before the creation of the  
> daemons and check the master node of the parallel job the $TMPDIR  
> of the job.
>
> -- Reuti
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list