[GE users] MPI woes

reuti reuti at staff.uni-marburg.de
Thu Nov 4 19:18:20 GMT 2010


Hi,

Am 04.11.2010 um 16:50 schrieb heine:

> Reuti,
> 
> Most of your assumptions are correct. See the configs below. I am (preferably) trying to get tight integration, but as I mentioned it does not work for openmpi or mpich2. I have tried 'start_proc_args /bin/true' and 'stop_proc_args /bin/true' too.

yes, this is sufficient. For MPICH2 you will need version 1.3 though, but there is nothing special to "./configure" like it is for Open MPI.


> Snippet from ompi_info
> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.2)

Fine.


> pe_name           openmpi_rr
> slots             168
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /sge/mpi/startmpi.sh -catch_hostname -catch_rsh -unique \
>                  $pe_hostfile
> stop_proc_args    /sge/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
> 
> 
> global:
> execd_spool_dir              /var/spool/sge
> mailer                       /usr/bin/mailx
> xterm                        /usr/bin/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 sh,ksh,csh,tcsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 auto
> load_report_time             00:00:40
> max_unheard                  00:05:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning

loglevel log_info

gives often helpful information.

> administrator_mail           heine at sun.ac.za
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               none
> execd_params                 none
> reporting_params             accounting=true reporting=true \
>                             flush_time=00:00:15 joblog=true sharelog=00:00:00
> finished_jobs                100
> gid_range                    20000-21000
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   0
> max_jobs                     0
> max_advance_reservations     0
> qlogin_command               /usr/local/bin/qlogin_wrapper
> qlogin_daemon                /usr/sbin/sshd -i

what about the entries:

rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin

They seem missing. Unless you need X11 forwarding, I would suggest to stay with "builtin" for `qlogin` too.

Are there local configurations defined for some nodes (`qconf -sconfl`)? In a cluster where all machines have the same OS this can be removed, as it was introduced to allow different paths to some applications on the various OS in a heterogenous cluster.

Some details about "*_command / *_daemon" settings:

http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=284728

To cite myself: In short: the selected methods from the issuing and target machine must match - either by using the default or a node specific configuration. I.e.: "rlogin_command for issuing machine A" (local configuration of A or global) must match method "rlogin_daemon for target B" (local configuration of B or global).

-- Reuti


> auto_user_oticket            0
> auto_user_fshare             0
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 false
> jsv_url                      none
> libjvm_path                  /usr/lib64/gcj-4.3-9/libjvm.so
> additional_jvm_args          -Xmx256m
> ________________________________________
> From: reuti [reuti at staff.uni-marburg.de]
> Sent: Thursday, November 04, 2010 4:39 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] MPI woes
> 
> Hi,
> 
> Am 04.11.2010 um 14:07 schrieb heine:
> 
>> Whenever I try to submit a job that would involve more than one node using allocation_rule $round_robin in a pe for example I get the following error.
>> 
>> error: commlib error: got read error (closing "comp009/execd/1")
>> error: commlib error: got read error (closing "comp010/execd/1")
>> error: executing task of job 62162 failed: failed sending task to execd at comp009: can't find connection
>> error: executing task of job 62162 failed: failed sending task to execd at comp010: can't find connection
>> 
>> I have tried with openmpi and mpich2, different versions even and the results remain the same. If I use a submit script to just copy the $TMPDIR/machines, submit the job and then use the machines file with the following example, it work 100%. 'mpiexec -np 3 --machinefile machines hostname', but adding the exact same back into the submit script generates the above error once more.
> 
> you mean in your test you reuse the generated machinefile from the last run, but use this then on the headnode of the cluster? Then the communication will be different, as in your jobscript the `mpiexec ...` will also be executed on one of your slaves.
> 
> Anyway: what is the complete definition of your PE (qconf -sp ...) and SGE's configuration (qconf -sconf)? Do you try to achieve a tight integration and compiled Open MPI with --with-sge?
> 
> When you have a tight integration, the parallel library won't use a direct rsh/ssh, but route it through `qrsh -inherit ...` which seems to fail.
> 
> 
> -- Reuti
> 
> 
>> Any help would be appreciated.
>> 
>> Heine
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=292698
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=292703
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=292759

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list