[GE users] SGE/OpenMPI - all MPI tasks run only on a single node

reuti reuti at staff.uni-marburg.de
Fri Dec 18 23:46:57 GMT 2009


Am 18.12.2009 um 23:40 schrieb k_clevenger:

>>
>> What happens if you remove all the SGE settings, they should be set
>> by SGE automatically. Are they only set and not exported?
>>
>
> They are. The only thing I'm manually setting is the PATH,  
> LD_LIBRARY_PATH, and compile flags. All other SGE variables come  
> from SGE.
>
>>
>>
>> I would assume, that Open MPI isn't detecting that it's running under
>> SGE - ARC, JOB_ID and PE_HOSTFILE are left untouched?
>
> We're using the packaged ge62u4_lx24-amd64.tar.gz binaries and have  
> tried openmpi 1.3.3 and 1.4.
>
> Here's the env dump from a simple job on the test cluster that  
> behaves exactly the same as the production cluster:
>
> ARC=lx24-amd64
> _=/bin/env
> CONSOLE=/dev/console
> CVS_RSH=ssh
> ENVIRONMENT=BATCH
> G_BROKEN_FILENAMES=1
> HOSTNAME=sgenode1.coh.org
> JAVA_HOME=/opt/jdk1.6.0_16
> JOB_ID=54
> JOB_NAME=Job
> JOB_SCRIPT=/opt/sge-6_2u4/default/spool/sgenode1/job_scripts/54
> LANG=en_US.UTF-8
> LD_LIBRARY_PATH=:/opt/sge-6_2u4/lib/lx24-amd64:/opt/openmpi-1.4/lib
> MPI_HOME=/opt/openmpi-1.4
> NHOSTS=2
> NQUEUES=2
> NSLOTS=2
> OPENMPI_HOME=/opt/openmpi-1.4
> PATH=/tmp/54.1.all.q:/opt/sge-6_2u4/bin/lx24-amd64:/usr/kerberos/ 
> bin:/usr/local/bin:/bin:/usr/bin:/opt/openmpi-1.4/bin:/opt/ 
> jdk1.6.0_16/bin:/home/kclevenger/bin
> PE_HOSTFILE=/opt/sge-6_2u4/default/spool/sgenode1/active_jobs/54.1/ 
> pe_hostfile
> PE=ompi
> previous=N
> PREVLEVEL=N
> QUEUE=all.q
> REQNAME=Job
> REQUEST=Job
> RESTARTED=0
> runlevel=3
> RUNLEVEL=3
> SELINUX_INIT=YES

Are you running SELinux? Can you turn it off, there are reported  
problems with it and SGE.


> SGE_ACCOUNT=sge
> SGE_ARCH=lx24-amd64
> SGE_BINARY_PATH=/opt/sge-6_2u4/bin/lx24-amd64
> SGE_CELL=default
> SGE_CLUSTER_NAME=default
> SGE_CWD_PATH=/home/kclevenger
> SGE_JOB_SPOOL_DIR=/opt/sge-6_2u4/default/spool/sgenode1/active_jobs/ 
> 54.1
> SGE_O_HOME=/home/kclevenger
> SGE_O_HOST=sgehead
> SGE_O_LOGNAME=kclevenger
> SGE_O_MAIL=/var/spool/mail/kclevenger
> SGE_O_PATH=/opt/sge-6_2u4/bin/lx24-amd64:/usr/kerberos/bin:/usr/ 
> local/bin:/bin:/usr/bin:/opt/openmpi-1.4/bin:/opt/jdk1.6.0_16/bin:/ 
> home/kclevenger/bin
> SGE_O_SHELL=/bin/bash
> SGE_O_WORKDIR=/home/kclevenger
> SGE_ROOT=/opt/sge-6_2u4
> SGE_STDERR_PATH=/home/kclevenger/Job.e54
> SGE_STDIN_PATH=/dev/null
> SGE_STDOUT_PATH=/home/kclevenger/Job.o54
> SGE_TASK_FIRST=undefined
> SGE_TASK_ID=undefined
> SGE_TASK_LAST=undefined
> SGE_TASK_STEPSIZE=undefined
> SHELL=/bin/bash
> SHLVL=2
> TMPDIR=/tmp/54.1.all.q
> TMP=/tmp/54.1.all.q
>
> The PE definition:
> pe_name            ompi
> slots              2

This is now a test-configuration - it was 32 in your last mail?


> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $round_robin # the default $pe_hostfile  
> absolutely will not work

Well, with one slot per node it can't find both, as $pe_slots implies  
to use only one machine.


> control_slaves     FALSE
> job_is_first_task  TRUE

This was different the last time, it should be:

control_slaves TRUE
job_is_first_task FALSE


> urgency_slots      min
> accounting_summary FALSE
>
> The all.q definition:
> qname                 all.q
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make ompi
> rerun                 FALSE
> slots                 2,[sgenode1.coh.org=1],[sgenode0.coh.org=1]

This is now a test-configuration with less slots?


> tmpdir                /tmp
> shell                 /bin/bash
> prolog                NONE
> epilog                NONE
> shell_start_mode      unix_behavior # I've tried both  
> posix_behavior and unix_behavior

Yes, unix_behavior is often better than the default.

-- Reuti


> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=234170
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234175

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list