[GE users] All OpenMPI process run on same node

reuti reuti at staff.uni-marburg.de
Mon Oct 25 17:42:20 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 25.10.2010 um 18:29 schrieb bwillems:

> I 'm having trouble with SGE/OpenMPI on a Rocks cluster as all
> processes of a parallel job tend to run on the same node. I searched
> the forums, but any past posts on this do not solve my problem. I
> compiled openmpi with
> 
> # ./configure --prefix=/share/apps?/openmpi/gcc --enable-static
> --with-libnuma --with-sge --with-openib=/opt/ofed CC=gcc CXX=g++
> F77=gfortran FC=gfortran
> 
> The PE I 'm using is
> 
> # qconf -sp mpi
> pe_name mpi
> slots 9999
> user_lists NONE
> xuser_lists NONE
> start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile
> stop_proc_args /opt/gridengine/mpi/stopmpi.sh
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary TRUE
> 
> 
> My test program is a simple mpihello compiled as
> 
> # /share/apps/openmpi/?gcc/bin/mpicc -o mpihello mpihello.c
> 
> and submitted with
> 
> #!/bin/bash
> # run job from current working directory
> #$ -cwd
> # combine stdout and stderr of job
> #$ -j y
> # use this shell as the default shell
> #$ -S /bin/bash
> 
> # "-l" specifies resource requirements of job. In this case we are
> # asking for 30 mins of computational time, as a hard requirement.
> #$ -l h_cpu=00:30:00
> # parallel environment and number of cores to use
> #$ -pe mpi 16
> # computational command to run
> /share/apps/openmpi/?gcc/bin/mpirun -machinefile $TMPDIR/machines -np
> $NSLOTS ./mpihello
> exit 0
> 
> This leads to 16 processes running on a single node with only 12 cores
> available. If I omit the machinefile option to mpirun, I get the

but this is the way to go: a plain mpiexec, I think even the "-np $NSLOTS" can be left out.


> following
> errors:
> 
> error: error: ending connection before all data received
> error:
> error reading job context from "qlogin_starter"
> --------------------?--------------------?--------------------?--------------

So, what's the output of:

$ qconf -sconf

for the entries of "rsh_daemon" and "rsh_command" (and which version of SGE are you using)?

-- Reuti


> A daemon (pid 16022) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------?--------------------?--------------------?--------------
> --------------------?--------------------?--------------------?--------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------?--------------------?--------------------?--------------
> mpirun: clean termination accomplished
> 
> 
> Pointing LD_LIBRARY_PATH to the libraries in the submission script
> does not help either.
> 
> Any suggestions?
> 
> Thanks,
> Bart
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=289973
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=289977

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list