[GE users] MPI woes

heine heine at sun.ac.za
Thu Nov 4 15:50:05 GMT 2010


Most of your assumptions are correct. See the configs below. I am (preferably) trying to get tight integration, but as I mentioned it does not work for openmpi or mpich2. I have tried 'start_proc_args /bin/true' and 'stop_proc_args /bin/true' too.

Snippet from ompi_info
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.2)

pe_name           openmpi_rr
slots             168
user_lists        NONE
xuser_lists       NONE
start_proc_args   /sge/mpi/startmpi.sh -catch_hostname -catch_rsh -unique \
stop_proc_args    /sge/mpi/stopmpi.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

execd_spool_dir              /var/spool/sge
mailer                       /usr/bin/mailx
xterm                        /usr/bin/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           heine at sun.ac.za
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=true \
                             flush_time=00:00:15 joblog=true sharelog=00:00:00
finished_jobs                100
gid_range                    20000-21000
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
max_advance_reservations     0
qlogin_command               /usr/local/bin/qlogin_wrapper
qlogin_daemon                /usr/sbin/sshd -i
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 false
jsv_url                      none
libjvm_path                  /usr/lib64/gcj-4.3-9/libjvm.so
additional_jvm_args          -Xmx256m
From: reuti [reuti at staff.uni-marburg.de]
Sent: Thursday, November 04, 2010 4:39 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] MPI woes


Am 04.11.2010 um 14:07 schrieb heine:

> Whenever I try to submit a job that would involve more than one node using allocation_rule $round_robin in a pe for example I get the following error.
> error: commlib error: got read error (closing "comp009/execd/1")
> error: commlib error: got read error (closing "comp010/execd/1")
> error: executing task of job 62162 failed: failed sending task to execd at comp009: can't find connection
> error: executing task of job 62162 failed: failed sending task to execd at comp010: can't find connection
> I have tried with openmpi and mpich2, different versions even and the results remain the same. If I use a submit script to just copy the $TMPDIR/machines, submit the job and then use the machines file with the following example, it work 100%. 'mpiexec -np 3 --machinefile machines hostname', but adding the exact same back into the submit script generates the above error once more.

you mean in your test you reuse the generated machinefile from the last run, but use this then on the headnode of the cluster? Then the communication will be different, as in your jobscript the `mpiexec ...` will also be executed on one of your slaves.

Anyway: what is the complete definition of your PE (qconf -sp ...) and SGE's configuration (qconf -sconf)? Do you try to achieve a tight integration and compiled Open MPI with --with-sge?

When you have a tight integration, the parallel library won't use a direct rsh/ssh, but route it through `qrsh -inherit ...` which seems to fail.

-- Reuti

> Any help would be appreciated.
> Heine


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list