[GE users] All OpenMPI process run on same node

bwillems bwi565 at gmail.com
Mon Oct 25 18:42:06 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Reuti, 

I am running SGE 6.2u4 and the output of "qconf -sconf" is


# qconf -sconf
#global:
execd_spool_dir              /opt/gridengine/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           none
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 H_MEMORYLOCKED=infinity
reporting_params             accounting=true reporting=true \
                             flush_time=00:00:15 joblog=true sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20100
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
max_advance_reservations     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 0
jsv_url                      none
qrsh_command                 /usr/bin/ssh
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w

Thanks,
Bart

> Hi,
> 
> Am 25.10.2010 um 18:29 schrieb bwillems:
> 
> > I 'm having trouble with SGE/OpenMPI on a Rocks cluster as all
> > processes of a parallel job tend to run on the same node. I searched
> > the forums, but any past posts on this do not solve my problem. I
> > compiled openmpi with
> > 
> > # ./configure --prefix=/share/apps?/openmpi/gcc --enable-static
> > --with-libnuma --with-sge --with-openib=/opt/ofed CC=gcc CXX=g++
> > F77=gfortran FC=gfortran
> > 
> > The PE I 'm using is
> > 
> > # qconf -sp mpi
> > pe_name mpi
> > slots 9999
> > user_lists NONE
> > xuser_lists NONE
> > start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile
> > stop_proc_args /opt/gridengine/mpi/stopmpi.sh
> > allocation_rule $fill_up
> > control_slaves TRUE
> > job_is_first_task FALSE
> > urgency_slots min
> > accounting_summary TRUE
> > 
> > 
> > My test program is a simple mpihello compiled as
> > 
> > # /share/apps/openmpi/?gcc/bin/mpicc -o mpihello mpihello.c
> > 
> > and submitted with
> > 
> > #!/bin/bash
> > # run job from current working directory
> > #$ -cwd
> > # combine stdout and stderr of job
> > #$ -j y
> > # use this shell as the default shell
> > #$ -S /bin/bash
> > 
> > # "-l" specifies resource requirements of job. In this case we are
> > # asking for 30 mins of computational time, as a hard requirement.
> > #$ -l h_cpu=00:30:00
> > # parallel environment and number of cores to use
> > #$ -pe mpi 16
> > # computational command to run
> > /share/apps/openmpi/?gcc/bin/mpirun -machinefile $TMPDIR/machines -np
> > $NSLOTS ./mpihello
> > exit 0
> > 
> > This leads to 16 processes running on a single node with only 12 cores
> > available. If I omit the machinefile option to mpirun, I get the
> 
> but this is the way to go: a plain mpiexec, I think even the "-np $NSLOTS" can be left out.
> 
> 
> > following
> > errors:
> > 
> > error: error: ending connection before all data received
> > error:
> > error reading job context from "qlogin_starter"
> > --------------------?--------------------?--------------------?--------------
> 
> So, what's the output of:
> 
> $ qconf -sconf
> 
> for the entries of "rsh_daemon" and "rsh_command" (and which version of SGE are you using)?
> 
> -- Reuti
> 
> 
> > A daemon (pid 16022) died unexpectedly with status 1 while attempting
> > to launch so we are aborting.
> > 
> > There may be more information reported by the environment (see above).
> > 
> > This may be because the daemon was unable to find all the needed shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> > --------------------?--------------------?--------------------?--------------
> > --------------------?--------------------?--------------------?--------------
> > mpirun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> > --------------------?--------------------?--------------------?--------------
> > mpirun: clean termination accomplished
> > 
> > 
> > Pointing LD_LIBRARY_PATH to the libraries in the submission script
> > does not help either.
> > 
> > Any suggestions?
> > 
> > Thanks,
> > Bart
> > 
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=289973
> > 
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=289990

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list