[GE users] MPI problems persist

reuti reuti at staff.uni-marburg.de
Tue Nov 9 15:50:25 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 09.11.2010 um 11:04 schrieb heine <heine at sun.ac.za<mailto:heine at sun.ac.za>>:

Good day all,

Rueti was correct when pointing me to the binaries. I somehow did extract the incorrect binaries before and now have the 'real' 6.2u5 binaries in place, have the following configured:

qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin

but still receive the following error when using 'allocation_rule    $round_robin'

[comp020:20021] ras:gridengine: JOB_ID: 62034
[comp020:20021] ras:gridengine: PE_HOSTFILE: /var/spool/sge/comp020/active_jobs/62034.1/pe_hostfile
[comp020:20021] ras:gridengine: comp020: PE_HOSTFILE shows slots=1
[comp020:20021] ras:gridengine: comp019: PE_HOSTFILE shows slots=1
[comp020:20021] ras:gridengine: comp017: PE_HOSTFILE shows slots=1
error: commlib error: got read error (closing "comp019/execd/1")
error: executing task of job 62034 failed: failed sending task to execd at comp019<mailto:execd at comp019>: can't find connection
error: commlib error: got read error (closing "comp017/execd/1")
error: executing task of job 62034 failed: failed sending task to execd at comp017<mailto:execd at comp017>: can't find connection
--------------------------------------------------------------------------
A daemon (pid 20025) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the

as mentioned above, it can't find the libraries. Often they are defined in ~/.bashrc like:

export LD_LIBRARY_PATH=/your/path/to/libs${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

As an alternative you can build a static version of Open MPI with:

--enable-static --disable-shared

to ./configure

-- Reuti

location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        comp019 - daemon did not report back when launched
        comp017 - daemon did not report back when launched

Thanks
Heine





More information about the gridengine-users mailing list