[GE users] SGE/OpenMPI - all MPI tasks run only on a single node

k_clevenger kclevenger at coh.org
Wed Dec 16 21:31:23 GMT 2009


Thanks for responding, this problem is somewhat perplexing

> Am 16.12.2009 um 20:16 schrieb k_clevenger:
> 
> > When an job is submitted all the tasks execute only on one node. If  
> > I submit the same job via mpiexec on the cmdline tasks are  
> > dispersed correctly.
> >
> > I have reviewed "OpenMPI job on stay on one node", "Using ssh with  
> > qrsh and qlogin", the SGE sections on the OpenMPI site, etc. with  
> > no solution.
> >
> > Nodes: 16 core x86_64 blades
> > OS (all): CentOS 5.4 x86_64
> > SGE Version: 6_2u4
> > OpenMPI Version: 1.3.3 compiled with --with-sge
> > ompi_info: MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
> > IPTables off
> >
> > PE:
> > pe_name            openmpi
> > slots              32
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    /opt/sge-6_2u4/mpi/startmpi.sh -catch_rsh  
> > $pe_hostfile
> > stop_proc_args     /opt/sge-6_2u4/mpi/stopmpi.sh
> 
> Both entries can be /bin/true. The defined procedures don't hurt, but  
> aren't necessary for a tight Open MPI integration.

OK

> 
> > allocation_rule    $round_robin
> > control_slaves     TRUE
> > job_is_first_task  FALSE
> > urgency_slots      min
> > accounting_summary FALSE
> >
> > SGE script:
> > #!/bin/sh
> > #$ -pe openmpi 22
> > #$ -N Para1
> > #$ -cwd
> > #$ -j y
> > #$ -V
> > #
> > mpiexec -np $NSLOTS -machinefile $TMPDIR/machines ./hello_c
> 
> You can leave "-machinefile $TMPDIR/machines" out.

When I do leave it out I get:

error: commlib error: got read error (closing "sunnode00.coh.org/execd/1")
error: executing task of job 262 failed: failed sending task to execd at sunnode00.coh.org: can't find connection
--------------------------------------------------------------------------
A daemon (pid 3692) died unexpectedly with status 1 while attempting to launch so we are aborting.

> 
> When you put a "sleep 30" in the jobscript and check with `qstat -g `  
> the allocation during execution: slots on both machines were granted?  

job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID 
---------------------------------------------------------------------------------------
    262 0.60500 Job        kclevenger   r     12/16/2009 13:24:51 all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
                                                                  all.q at sunnode00.coh.org        SLAVE         
    262 0.60500 Job        kclevenger   r     12/16/2009 13:24:51 all.q at sunnode01.coh.org        MASTER        
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE         
                                                                  all.q at sunnode01.coh.org        SLAVE    


> The PE is attached as default to the queue or listed in both  
> machine's specific settings?

I'm not certain what you mean here

> 
> > Run via SGE
> > Hello, world, I am 0 of 22 running on sunnode00.coh.org
> > Hello, world, I am 1 of 22 running on sunnode00.coh.org
> > ...
> > Hello, world, I am 20 of 22 running on sunnode00.coh.org
> > Hello, world, I am 21 of 22 running on sunnode00.coh.org
> >
> > All 22 tasks run on sunnode00
> >
> > Run via cmdline 'mpiexec -np 22 -machinefile $HOME/machines ./hello_c'
> > Hello, world, I am 0 of 22 running on sunnode00.coh.org
> > Hello, world, I am 1 of 22 running on sunnode01.coh.org
> > ....
> > Hello, world, I am 20 of 22 running on sunnode00.coh.org
> > Hello, world, I am 21 of 22 running on sunnode01.coh.org
> >
> > 11 tasks run on sunnode00 and 11 tasks run on sunnode01
> >
> > I also get all 22 tasks running on one node if I run something like  
> > 'qrsh -V -verbose -pe openmpi 22 mpirun -np 22 -machinefile $HOME/ 
> > machines $HOME/test/hello'
> >
> > qconf -sconf output is attached
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do? 
> > dsForumId=38&dsMessageId=233785
> >
> > To unsubscribe from this discussion, e-mail: [users- 
> > unsubscribe at gridengine.sunsource.net].<qconf.txt>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=233807

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list