[GE users] SGE/OpenMPI - all MPI tasks run only on a single node

k_clevenger kclevenger at coh.org
Fri Dec 18 22:40:56 GMT 2009


> 
> What happens if you remove all the SGE settings, they should be set  
> by SGE automatically. Are they only set and not exported?
> 

They are. The only thing I'm manually setting is the PATH, LD_LIBRARY_PATH, and compile flags. All other SGE variables come from SGE.

> 
> 
> I would assume, that Open MPI isn't detecting that it's running under  
> SGE - ARC, JOB_ID and PE_HOSTFILE are left untouched?

We're using the packaged ge62u4_lx24-amd64.tar.gz binaries and have tried openmpi 1.3.3 and 1.4. 

Here's the env dump from a simple job on the test cluster that behaves exactly the same as the production cluster:

ARC=lx24-amd64
_=/bin/env
CONSOLE=/dev/console
CVS_RSH=ssh
ENVIRONMENT=BATCH
G_BROKEN_FILENAMES=1
HOSTNAME=sgenode1.coh.org
JAVA_HOME=/opt/jdk1.6.0_16
JOB_ID=54
JOB_NAME=Job
JOB_SCRIPT=/opt/sge-6_2u4/default/spool/sgenode1/job_scripts/54
LANG=en_US.UTF-8
LD_LIBRARY_PATH=:/opt/sge-6_2u4/lib/lx24-amd64:/opt/openmpi-1.4/lib
MPI_HOME=/opt/openmpi-1.4
NHOSTS=2
NQUEUES=2
NSLOTS=2
OPENMPI_HOME=/opt/openmpi-1.4
PATH=/tmp/54.1.all.q:/opt/sge-6_2u4/bin/lx24-amd64:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/openmpi-1.4/bin:/opt/jdk1.6.0_16/bin:/home/kclevenger/bin
PE_HOSTFILE=/opt/sge-6_2u4/default/spool/sgenode1/active_jobs/54.1/pe_hostfile
PE=ompi
previous=N
PREVLEVEL=N
QUEUE=all.q
REQNAME=Job
REQUEST=Job
RESTARTED=0
runlevel=3
RUNLEVEL=3
SELINUX_INIT=YES
SGE_ACCOUNT=sge
SGE_ARCH=lx24-amd64
SGE_BINARY_PATH=/opt/sge-6_2u4/bin/lx24-amd64
SGE_CELL=default
SGE_CLUSTER_NAME=default
SGE_CWD_PATH=/home/kclevenger
SGE_JOB_SPOOL_DIR=/opt/sge-6_2u4/default/spool/sgenode1/active_jobs/54.1
SGE_O_HOME=/home/kclevenger
SGE_O_HOST=sgehead
SGE_O_LOGNAME=kclevenger
SGE_O_MAIL=/var/spool/mail/kclevenger
SGE_O_PATH=/opt/sge-6_2u4/bin/lx24-amd64:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/openmpi-1.4/bin:/opt/jdk1.6.0_16/bin:/home/kclevenger/bin
SGE_O_SHELL=/bin/bash
SGE_O_WORKDIR=/home/kclevenger
SGE_ROOT=/opt/sge-6_2u4
SGE_STDERR_PATH=/home/kclevenger/Job.e54
SGE_STDIN_PATH=/dev/null
SGE_STDOUT_PATH=/home/kclevenger/Job.o54
SGE_TASK_FIRST=undefined
SGE_TASK_ID=undefined
SGE_TASK_LAST=undefined
SGE_TASK_STEPSIZE=undefined
SHELL=/bin/bash
SHLVL=2
TMPDIR=/tmp/54.1.all.q
TMP=/tmp/54.1.all.q

The PE definition:
pe_name            ompi
slots              2
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin # the default $pe_hostfile absolutely will not work
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

The all.q definition:
qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make ompi
rerun                 FALSE
slots                 2,[sgenode1.coh.org=1],[sgenode0.coh.org=1]
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      unix_behavior # I've tried both posix_behavior and unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234170

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list