[GE users] All OpenMPI process run on same node

reuti reuti at staff.uni-marburg.de
Wed Oct 27 00:22:00 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 25.10.2010 um 19:42 schrieb bwillems <bwi565 at gmail.com>:

> Hi Reuti,
>
> I am running SGE 6.2u4 and the output of "qconf -sconf" is
>
>
> # qconf -sconf
> #global:
> execd_spool_dir              /opt/gridengine/default/spool
> mailer                       /bin/mail
> xterm                        /usr/bin/X11/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 sh,ksh,csh,tcsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 auto
> load_report_time             00:00:40
> max_unheard                  00:05:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           none
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               none
> execd_params                 H_MEMORYLOCKED=infinity
> reporting_params             accounting=true reporting=true \
>                             flush_time=00:00:15 joblog=true  
> sharelog=00:00:00
> finished_jobs                100
> gid_range                    20000-20100
> qlogin_command               builtin
> qlogin_daemon                builtin
> rlogin_command               builtin
> rlogin_daemon                builtin
> rsh_command                  builtin
> rsh_daemon                   builtin
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   0
> max_jobs                     0
> max_advance_reservations     0
> auto_user_oticket            0
> auto_user_fshare             0
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 0
> jsv_url                      none
> qrsh_command                 /usr/bin/ssh
> rsh_command                  /usr/bin/ssh
> rlogin_command               /usr/bin/ssh

Please remove the above three lines. You set them up already some  
lines before to have the value of "builtin". For now you have a  
mixture of two communication methods - and this explains the error you  
got.

When you use the plain SSH by disabling the tight SGE integration, It  
would run as you observed, but with the mentioned drawbacks.

I wonder, how other parallel jobs are running with this setup in the  
cluster.

And as I menioned in my last reply: start/stop_proc_args can be set  
to /bin/true (you don't need to create a custom machinefile), and a  
plain 'mpiexec myprogram' should do, as all information is already  
provided by SGE.

-- Reuti


> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>
> Thanks,
> Bart
>
>> Hi,
>>
>> Am 25.10.2010 um 18:29 schrieb bwillems:
>>
>>> I 'm having trouble with SGE/OpenMPI on a Rocks cluster as all
>>> processes of a parallel job tend to run on the same node. I searched
>>> the forums, but any past posts on this do not solve my problem. I
>>> compiled openmpi with
>>>
>>> # ./configure --prefix=/share/apps?/openmpi/gcc --enable-static
>>> --with-libnuma --with-sge --with-openib=/opt/ofed CC=gcc CXX=g++
>>> F77=gfortran FC=gfortran
>>>
>>> The PE I 'm using is
>>>
>>> # qconf -sp mpi
>>> pe_name mpi
>>> slots 9999
>>> user_lists NONE
>>> xuser_lists NONE
>>> start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile
>>> stop_proc_args /opt/gridengine/mpi/stopmpi.sh
>>> allocation_rule $fill_up
>>> control_slaves TRUE
>>> job_is_first_task FALSE
>>> urgency_slots min
>>> accounting_summary TRUE
>>>
>>>
>>> My test program is a simple mpihello compiled as
>>>
>>> # /share/apps/openmpi/?gcc/bin/mpicc -o mpihello mpihello.c
>>>
>>> and submitted with
>>>
>>> #!/bin/bash
>>> # run job from current working directory
>>> #$ -cwd
>>> # combine stdout and stderr of job
>>> #$ -j y
>>> # use this shell as the default shell
>>> #$ -S /bin/bash
>>>
>>> # "-l" specifies resource requirements of job. In this case we are
>>> # asking for 30 mins of computational time, as a hard requirement.
>>> #$ -l h_cpu=00:30:00
>>> # parallel environment and number of cores to use
>>> #$ -pe mpi 16
>>> # computational command to run
>>> /share/apps/openmpi/?gcc/bin/mpirun -machinefile $TMPDIR/machine 
>>> s -np
>>> $NSLOTS ./mpihello
>>> exit 0
>>>
>>> This leads to 16 processes running on a single node with only 12  
>>> cores
>>> available. If I omit the machinefile option to mpirun, I get the
>>
>> but this is the way to go: a plain mpiexec, I think even the "-np  
>> $NSLOTS" can be left out.
>>
>>
>>> following
>>> errors:
>>>
>>> error: error: ending connection before all data received
>>> error:
>>> error reading job context from "qlogin_starter"
>>> --------------------?--------------------? 
>>> --------------------?--------------
>>
>> So, what's the output of:
>>
>> $ qconf -sconf
>>
>> for the entries of "rsh_daemon" and "rsh_command" (and which  
>> version of SGE are you using)?
>>
>> -- Reuti
>>
>>
>>> A daemon (pid 16022) died unexpectedly with status 1 while  
>>> attempting
>>> to launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see  
>>> above).
>>>
>>> This may be because the daemon was unable to find all the needed  
>>> shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to  
>>> have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --------------------?--------------------? 
>>> --------------------?--------------
>>> --------------------?--------------------? 
>>> --------------------?--------------
>>> mpirun noticed that the job aborted, but has no info as to the  
>>> process
>>> that caused that situation.
>>> --------------------?--------------------? 
>>> --------------------?--------------
>>> mpirun: clean termination accomplished
>>>
>>>
>>> Pointing LD_LIBRARY_PATH to the libraries in the submission script
>>> does not help either.
>>>
>>> Any suggestions?
>>>
>>> Thanks,
>>> Bart
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=289973
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=289990
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=290329

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list