[GE users] SGE/OpenMPI - all MPI tasks run only on a single node

reuti reuti at staff.uni-marburg.de
Wed Dec 16 20:51:35 GMT 2009


Am 16.12.2009 um 20:16 schrieb k_clevenger:

> When an job is submitted all the tasks execute only on one node. If  
> I submit the same job via mpiexec on the cmdline tasks are  
> dispersed correctly.
>
> I have reviewed "OpenMPI job on stay on one node", "Using ssh with  
> qrsh and qlogin", the SGE sections on the OpenMPI site, etc. with  
> no solution.
>
> Nodes: 16 core x86_64 blades
> OS (all): CentOS 5.4 x86_64
> SGE Version: 6_2u4
> OpenMPI Version: 1.3.3 compiled with --with-sge
> ompi_info: MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
> IPTables off
>
> PE:
> pe_name            openmpi
> slots              32
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /opt/sge-6_2u4/mpi/startmpi.sh -catch_rsh  
> $pe_hostfile
> stop_proc_args     /opt/sge-6_2u4/mpi/stopmpi.sh

Both entries can be /bin/true. The defined procedures don't hurt, but  
aren't necessary for a tight Open MPI integration.


> allocation_rule    $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
>
> SGE script:
> #!/bin/sh
> #$ -pe openmpi 22
> #$ -N Para1
> #$ -cwd
> #$ -j y
> #$ -V
> #
> mpiexec -np $NSLOTS -machinefile $TMPDIR/machines ./hello_c

You can leave "-machinefile $TMPDIR/machines" out.

When you put a "sleep 30" in the jobscript and check with `qstat -g `  
the allocation during execution: slots on both machines were granted?  
The PE is attached as default to the queue or listed in both  
machine's specific settings?


> Run via SGE
> Hello, world, I am 0 of 22 running on sunnode00.coh.org
> Hello, world, I am 1 of 22 running on sunnode00.coh.org
> ...
> Hello, world, I am 20 of 22 running on sunnode00.coh.org
> Hello, world, I am 21 of 22 running on sunnode00.coh.org
>
> All 22 tasks run on sunnode00
>
> Run via cmdline 'mpiexec -np 22 -machinefile $HOME/machines ./hello_c'
> Hello, world, I am 0 of 22 running on sunnode00.coh.org
> Hello, world, I am 1 of 22 running on sunnode01.coh.org
> ....
> Hello, world, I am 20 of 22 running on sunnode00.coh.org
> Hello, world, I am 21 of 22 running on sunnode01.coh.org
>
> 11 tasks run on sunnode00 and 11 tasks run on sunnode01
>
> I also get all 22 tasks running on one node if I run something like  
> 'qrsh -V -verbose -pe openmpi 22 mpirun -np 22 -machinefile $HOME/ 
> machines $HOME/test/hello'
>
> qconf -sconf output is attached
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=233785
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].<qconf.txt>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=233800

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list