[GE users] SGE/OpenMPI - all MPI tasks run only on a single node

reuti reuti at staff.uni-marburg.de
Thu Dec 17 00:31:19 GMT 2009


Am 17.12.2009 um 00:30 schrieb k_clevenger:

>>> # which mpiexec
>>> /opt/openmpi-1.3.3/bin/mpiexec
>>>
>>> # ls -l /opt/openmpi-1.3.3/bin/mpiexec
>>> lrwxrwxrwx 1 root root 7 Nov  6 13:57 /opt/openmpi-1.3.3/bin/
>>> mpiexec -> orterun
>>>
>>> # ldd /opt/openmpi-1.3.3/bin/orterun
>>>   libopen-rte.so.0 => /opt/openmpi-1.3.3/lib/libopen-rte.so.0
>>> (0x00002aaaaaaad000)
>>>   libopen-pal.so.0 => /opt/openmpi-1.3.3/lib/libopen-pal.so.0
>>> (0x00002aaaaacf4000)
>>>   libdl.so.2 => /lib64/libdl.so.2 (0x0000003d2ec00000)
>>>   libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003d31c00000)
>>>   libutil.so.1 => /lib64/libutil.so.1 (0x0000003d3b600000)
>>>   libm.so.6 => /lib64/libm.so.6 (0x0000003d2f000000)
>>>   libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003d2f400000)
>>>   libc.so.6 => /lib64/libc.so.6 (0x0000003d2e800000)
>>>   /lib64/ld-linux-x86-64.so.2 (0x0000003d2e400000)
>>
>> What is the output, when you test this inside a jobscript (and also a
>> ldd hello_c). Depending on the .bashrc, the paths could be different
>> inside a jobscript.
>
> ldd /opt/openmpi-1.3.3/bin/orterun from within a job
>   libopen-rte.so.0 => /opt/openmpi-1.3.3/lib/libopen-rte.so.0  
> (0x00002aaaaaaad000)
>   libopen-pal.so.0 => /opt/openmpi-1.3.3/lib/libopen-pal.so.0  
> (0x00002aaaaacf4000)
>   libdl.so.2 => /lib64/libdl.so.2 (0x0000003de3800000)
>   libnsl.so.1 => /lib64/libnsl.so.1 (0x000000330c000000)
>   libutil.so.1 => /lib64/libutil.so.1 (0x000000330a800000)
>   libm.so.6 => /lib64/libm.so.6 (0x0000003de4800000)
>   libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003de3c00000)
>   libc.so.6 => /lib64/libc.so.6 (0x0000003de3400000)
>   /lib64/ld-linux-x86-64.so.2 (0x0000003de3000000)
>
> ldd ./hello_c from within a job
>   libmpi.so.0 => /opt/openmpi-1.3.3/lib/libmpi.so.0  
> (0x00002aaaaaaad000)
>   libopen-rte.so.0 => /opt/openmpi-1.3.3/lib/libopen-rte.so.0  
> (0x00002aaaaad50000)
>   libopen-pal.so.0 => /opt/openmpi-1.3.3/lib/libopen-pal.so.0  
> (0x00002aaaaaf97000)
>   libdl.so.2 => /lib64/libdl.so.2 (0x0000003de3800000)
>   libnsl.so.1 => /lib64/libnsl.so.1 (0x000000330c000000)
>   libutil.so.1 => /lib64/libutil.so.1 (0x000000330a800000)
>   libm.so.6 => /lib64/libm.so.6 (0x0000003de4800000)
>   libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003de3c00000)
>   libc.so.6 => /lib64/libc.so.6 (0x0000003de3400000)
>   /lib64/ld-linux-x86-64.so.2 (0x0000003de3000000)
>
>> If you want to avoid dynamic binaries: I prefer to compile Open MPI
>> with --enabled-static --disable-shared
>>
>
> The environment should be ok across the cluster, I have a  
> standard .bashrc include that sets all the PATH/LD_LIBRARY_PATH/etc  
> variables for all the apps. The relevant parts of the user  
> environment looks like:
>
> LANG=en_US.UTF-8
> LD_LIBRARY_PATH=/opt/sge-6_2u4/lib/lx24-amd64:/opt/openmpi-1.3.3/ 
> lib:...
> MPI_HOME=/opt/openmpi-1.3.3
> OPENMPI_HOME=/opt/openmpi-1.3.3
> PATH=/opt/sge-6_2u4/bin/lx24-amd64:/opt/openmpi-1.3.3/bin:...

What happens if you remove all the SGE settings, they should be set  
by SGE automatically. Are they only set and not exported?

> SGE_CELL=default
> SGE_CLUSTER_NAME=suncluster
> SGE_EXECD_PORT=6445
> SGE_QMASTER_PORT=6444
> SGE_ROOT=/opt/sge-6_2u4
> SHELL=/bin/bash
> SHLVL=1

It's unusual to redefine TMPDIR - it's already set by SGE, which  
includes the job id and queue name to avoid conflicts between  
programs running at the same time.

> TMPDIR=/tmp


I would assume, that Open MPI isn't detecting that it's running under  
SGE - ARC, JOB_ID and PE_HOSTFILE are left untouched?

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=233832

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list