[GE users] All OpenMPI process run on same node

reuti reuti at staff.uni-marburg.de
Thu Oct 28 09:54:45 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 27.10.2010 um 23:22 schrieb bwillems:

> Thanks Reuti! I found that the problem also doesn't occur with the openMPI that comes with Rocks,

Maybe this is running outside of SGE (i.e. was compiled w/o --with-sge).

-- Reuti


> only with my own compiled version. I 'll make the changes you suggest and investigate further what the differences are.
> 
> Thanks,
> Bart
> 
>> Am 25.10.2010 um 19:42 schrieb bwillems <bwi565 at gmail.com>:
>> 
>>> Hi Reuti,
>>> 
>>> I am running SGE 6.2u4 and the output of "qconf -sconf" is
>>> 
>>> 
>>> # qconf -sconf
>>> #global:
>>> execd_spool_dir              /opt/gridengine/default/spool
>>> mailer                       /bin/mail
>>> xterm                        /usr/bin/X11/xterm
>>> load_sensor                  none
>>> prolog                       none
>>> epilog                       none
>>> shell_start_mode             posix_compliant
>>> login_shells                 sh,ksh,csh,tcsh
>>> min_uid                      0
>>> min_gid                      0
>>> user_lists                   none
>>> xuser_lists                  none
>>> projects                     none
>>> xprojects                    none
>>> enforce_project              false
>>> enforce_user                 auto
>>> load_report_time             00:00:40
>>> max_unheard                  00:05:00
>>> reschedule_unknown           00:00:00
>>> loglevel                     log_warning
>>> administrator_mail           none
>>> set_token_cmd                none
>>> pag_cmd                      none
>>> token_extend_time            none
>>> shepherd_cmd                 none
>>> qmaster_params               none
>>> execd_params                 H_MEMORYLOCKED=infinity
>>> reporting_params             accounting=true reporting=true \
>>>                            flush_time=00:00:15 joblog=true  
>>> sharelog=00:00:00
>>> finished_jobs                100
>>> gid_range                    20000-20100
>>> qlogin_command               builtin
>>> qlogin_daemon                builtin
>>> rlogin_command               builtin
>>> rlogin_daemon                builtin
>>> rsh_command                  builtin
>>> rsh_daemon                   builtin
>>> max_aj_instances             2000
>>> max_aj_tasks                 75000
>>> max_u_jobs                   0
>>> max_jobs                     0
>>> max_advance_reservations     0
>>> auto_user_oticket            0
>>> auto_user_fshare             0
>>> auto_user_default_project    none
>>> auto_user_delete_time        86400
>>> delegated_file_staging       false
>>> reprioritize                 0
>>> jsv_url                      none
>>> qrsh_command                 /usr/bin/ssh
>>> rsh_command                  /usr/bin/ssh
>>> rlogin_command               /usr/bin/ssh
>> 
>> Please remove the above three lines. You set them up already some  
>> lines before to have the value of "builtin". For now you have a  
>> mixture of two communication methods - and this explains the error you  
>> got.
>> 
>> When you use the plain SSH by disabling the tight SGE integration, It  
>> would run as you observed, but with the mentioned drawbacks.
>> 
>> I wonder, how other parallel jobs are running with this setup in the  
>> cluster.
>> 
>> And as I menioned in my last reply: start/stop_proc_args can be set  
>> to /bin/true (you don't need to create a custom machinefile), and a  
>> plain 'mpiexec myprogram' should do, as all information is already  
>> provided by SGE.
>> 
>> -- Reuti
>> 
>> 
>>> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>>> 
>>> Thanks,
>>> Bart
>>> 
>>>> Hi,
>>>> 
>>>> Am 25.10.2010 um 18:29 schrieb bwillems:
>>>> 
>>>>> I 'm having trouble with SGE/OpenMPI on a Rocks cluster as all
>>>>> processes of a parallel job tend to run on the same node. I searched
>>>>> the forums, but any past posts on this do not solve my problem. I
>>>>> compiled openmpi with
>>>>> 
>>>>> # ./configure --prefix=/share/apps?/openmpi/gcc --enable-static
>>>>> --with-libnuma --with-sge --with-openib=/opt/ofed CC=gcc CXX=g++
>>>>> F77=gfortran FC=gfortran
>>>>> 
>>>>> The PE I 'm using is
>>>>> 
>>>>> # qconf -sp mpi
>>>>> pe_name mpi
>>>>> slots 9999
>>>>> user_lists NONE
>>>>> xuser_lists NONE
>>>>> start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile
>>>>> stop_proc_args /opt/gridengine/mpi/stopmpi.sh
>>>>> allocation_rule $fill_up
>>>>> control_slaves TRUE
>>>>> job_is_first_task FALSE
>>>>> urgency_slots min
>>>>> accounting_summary TRUE
>>>>> 
>>>>> 
>>>>> My test program is a simple mpihello compiled as
>>>>> 
>>>>> # /share/apps/openmpi/?gcc/bin/mpicc -o mpihello mpihello.c
>>>>> 
>>>>> and submitted with
>>>>> 
>>>>> #!/bin/bash
>>>>> # run job from current working directory
>>>>> #$ -cwd
>>>>> # combine stdout and stderr of job
>>>>> #$ -j y
>>>>> # use this shell as the default shell
>>>>> #$ -S /bin/bash
>>>>> 
>>>>> # "-l" specifies resource requirements of job. In this case we are
>>>>> # asking for 30 mins of computational time, as a hard requirement.
>>>>> #$ -l h_cpu=00:30:00
>>>>> # parallel environment and number of cores to use
>>>>> #$ -pe mpi 16
>>>>> # computational command to run
>>>>> /share/apps/openmpi/?gcc/bin/mpirun -machinefile $TMPDIR/machine 
>>>>> s -np
>>>>> $NSLOTS ./mpihello
>>>>> exit 0
>>>>> 
>>>>> This leads to 16 processes running on a single node with only 12  
>>>>> cores
>>>>> available. If I omit the machinefile option to mpirun, I get the
>>>> 
>>>> but this is the way to go: a plain mpiexec, I think even the "-np  
>>>> $NSLOTS" can be left out.
>>>> 
>>>> 
>>>>> following
>>>>> errors:
>>>>> 
>>>>> error: error: ending connection before all data received
>>>>> error:
>>>>> error reading job context from "qlogin_starter"
>>>>> --------------------?--------------------? 
>>>>> --------------------?--------------
>>>> 
>>>> So, what's the output of:
>>>> 
>>>> $ qconf -sconf
>>>> 
>>>> for the entries of "rsh_daemon" and "rsh_command" (and which  
>>>> version of SGE are you using)?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> A daemon (pid 16022) died unexpectedly with status 1 while  
>>>>> attempting
>>>>> to launch so we are aborting.
>>>>> 
>>>>> There may be more information reported by the environment (see  
>>>>> above).
>>>>> 
>>>>> This may be because the daemon was unable to find all the needed  
>>>>> shared
>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to  
>>>>> have the
>>>>> location of the shared libraries on the remote nodes and this will
>>>>> automatically be forwarded to the remote nodes.
>>>>> --------------------?--------------------? 
>>>>> --------------------?--------------
>>>>> --------------------?--------------------? 
>>>>> --------------------?--------------
>>>>> mpirun noticed that the job aborted, but has no info as to the  
>>>>> process
>>>>> that caused that situation.
>>>>> --------------------?--------------------? 
>>>>> --------------------?--------------
>>>>> mpirun: clean termination accomplished
>>>>> 
>>>>> 
>>>>> Pointing LD_LIBRARY_PATH to the libraries in the submission script
>>>>> does not help either.
>>>>> 
>>>>> Any suggestions?
>>>>> 
>>>>> Thanks,
>>>>> Bart
>>>>> 
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=289973
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>> 
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=289990
>>> 
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>> 
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=290568
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=290776

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list