[GE users] LAM & SGE

Orion Poplawski orion at cora.nwra.com
Fri Aug 27 17:41:57 BST 2004

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

> The qrsh-lam script is meant for what is often called a tight
> integration where the batch system has full control over the processes
> and also has accounting and resource usage info on the job.
> When rsh or ssh is used for the job launch SGE is not fully aware of the
> processes since they will not be child processes of the shepherd procs.
> For the PE below you change control_slaves to FALSE.
> Attached is the latest sge-lam script which should work with SGE5.3p5 or
> later patch level and LAM 7.0.6 or later. I'll try to get a web page
> with it up onto the open source site. I have not had time to test it
> with SGE 6.0 but don't know of any reason it shouldnt work.
> Still need to get this working with ssh to get to the remote nodes which
> should fix the qrsh problem Bogdan is referencing. the problem is a
> limitation where this scheme fails for parallel jobs with only 1 process
> per host. If you have dual CPU boxes or fatter the problem should not
> occur. I'm having some trouble reproducing the problem on my small
> cluster so if you have a larger cluster of single procs and try this and
> it works let me know.
> If you try this with SGE 6.0 and have problems or it works let me know.
> --
>   ___________________________________________________________
> | Christopher Duncan    Sun HPC Support Engineer [PTS-AMER] |
> | Email: christopher.duncan at sun.com    Tel: +1 781-442-2309 |
> |                   -=Carpe Noctem=-                        |
> |___________________________________________________________|
> ---------------------------------------------------------------------

I've not had any luck with this under SGE 6.0.  I end up with a runaway
qrsh process using 100% cpu and the job fails and puts the queue in an
error state.  Both SGE and lam are configured to use ssh.

Some info:

qrsh strace:
select(1, [0], [0], NULL, {1, 0})       = 2 (in [0], out [0], left {1, 0})
gettimeofday({1093624246, 908925}, NULL) = 0
gettimeofday({1093624246, 908990}, NULL) = 0
gettimeofday({1093624246, 909054}, NULL) = 0
gettimeofday({1093624246, 909119}, NULL) = 0
select(1, [0], [0], NULL, {1, 0})       = 2 (in [0], out [0], left {1, 0})
gettimeofday({1093624246, 909516}, NULL) = 0
gettimeofday({1093624246, 909582}, NULL) = 0
gettimeofday({1093624246, 909646}, NULL) = 0
gettimeofday({1093624246, 909711}, NULL) = 0
select(1, [0], [0], NULL, {1, 0}{1093624246, 915170}, NULL) = 0
.... and so on.

SGE-LAM DEBUG: SGE_ROOT = /opt/local/sge-6.0
SGE-LAM DEBUG: qrsh = /opt/local/sge-6.0/bin/lx24-x86/qrsh
SGE-LAM DEBUG: func=start
SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi boot_rsh_agent
qrsh-remote -c sge-lam-conf.lamd -v -d /tmp/38136.1.cynosure.q/lamhostfile
n-1<11478> ssi:boot: Opening
n-1<11478> ssi:boot: looking for module named rsh
n-1<11478> ssi:boot: opening module rsh
n-1<11478> ssi:boot: initializing module rsh
n-1<11478> ssi:boot:rsh: module initializing
n-1<11478> ssi:boot:rsh:agent: /usr/bin/sge-lam
n-1<11478> ssi:boot:rsh:username: <same>
n-1<11478> ssi:boot:rsh:verbose: 1000
n-1<11478> ssi:boot:rsh:algorithm: linear
n-1<11478> ssi:boot:rsh:priority: 10
n-1<11478> ssi:boot: Selected boot module rsh
n-1<11478> ssi:boot:base: looking for boot schema in following directories:
n-1<11478> ssi:boot:base:   <current directory>
n-1<11478> ssi:boot:base:   $TROLLIUSHOME/etc
n-1<11478> ssi:boot:base:   $LAMHOME/etc
n-1<11478> ssi:boot:base:   /etc/lam
n-1<11478> ssi:boot:base: looking for boot schema file:
n-1<11478> ssi:boot:base:   /tmp/38136.1.cynosure.q/lamhostfile
n-1<11478> ssi:boot:base: found boot schema:
n-1<11478> ssi:boot:rsh: found the following hosts:
n-1<11478> ssi:boot:rsh:   n0 cynosure.cora.nwra.com (cpu=2)
n-1<11478> ssi:boot:rsh: resolved hosts:
n-1<11478> ssi:boot:rsh:   n0 cynosure.cora.nwra.com -->
n-1<11478> ssi:boot:rsh: starting RTE procs
n-1<11478> ssi:boot:base:linear: starting
n-1<11478> ssi:boot:base:server: opening server TCP socket
n-1<11478> ssi:boot:base:server: opened port 58387
n-1<11478> ssi:boot:base:linear: booting n0 (cynosure.cora.nwra.com)
n-1<11478> ssi:boot:rsh: starting lamd on (cynosure.cora.nwra.com)
n-1<11478> ssi:boot:rsh: starting on n0 (cynosure.cora.nwra.com): hboot -t
-c sge-lam-conf.lamd -d -v -sessionsuffix sge-38136-undefined -I -H -P 58387 -n 0 -o 0
n-1<11478> ssi:boot:rsh: launching locally
n-1<11478> ssi:boot:rsh: successfully launched on n0 (cynosure.cora.nwra.com)
n-1<11478> ssi:boot:base:server: expecting connection from finite list
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

As far as LAM could tell, the remote process started properly, but
then never called back.  Possible reasons that this may happen:

        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random

2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line.  For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error.  If
   you get any other kind of error, it could indicate either of the
   two conditions above.  Consult with your system/network
n-1<11478> ssi:boot:base:server: failed to connect to remote lamd!
n-1<11478> ssi:boot:base:server: closing server socket
n-1<11478> ssi:boot:base:linear: aborted!
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
lamboot did NOT complete successfully

SGE-LAM DEBUG: SGE_ROOT = /opt/local/sge-6.0
SGE-LAM DEBUG: qrsh = /opt/local/sge-6.0/bin/lx24-x86/qrsh
SGE-LAM DEBUG: ARGV = "qrsh-local" "/usr/bin/lamd" "-H" ""
"-P" "58387" "-n" "0" "-o" "0" "-d" "-sessionsuffix" "sge-38136-undefined"
SGE-LAM DEBUG: func=qrsh-local
SGE-LAM DEBUG: QRSH LOCAL CONFIG: -inherit -nostdin -V
cynosure.colorado-research.com /usr/bin/lamd -H -P 58387 -n
0 -o 0 -d -sessionsuffix sge-38136-undefined

SGE-LAM DEBUG: SGE_ROOT = /opt/local/sge-6.0
SGE-LAM DEBUG: qrsh = /opt/local/sge-6.0/bin/lx24-x86/qrsh
SGE-LAM DEBUG: func=stop
It seems that there is no lamd running on the host

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "lamhalt" command.

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.

Starting SGE + LAM Integration
         using tight integration scheme
tkill: setting prefix to (null)
tkill: setting suffix to sge-38136-undefined
tkill: got killname back:
/tmp/38136.1.cynosure.q/lam-orion at cynosure.colorado-research.com-sge-38136-undefined/lam-killfile
tkill: removing socket file ...
tkill: socket file:
/tmp/38136.1.cynosure.q/lam-orion at cynosure.colorado-research.com-sge-38136-undefined/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/38136.1.cynosure.q/lam-orion at cynosure.colorado-research.com-sge-38136-undefined/lam-io-socket
tkill: f_kill =
"/tmp/38136.1.cynosure.q/lam-orion at cynosure.colorado-research.com-sge-38136-undefined/lam-killfile"
tkill: nothing to kill:
"/tmp/38136.1.cynosure.q/lam-orion at cynosure.colorado-research.com-sge-38136-undefined/lam-killfile"
hboot: performing tkill
hboot: tkill -sessionsuffix sge-38136-undefined -d
hboot: booting...
hboot: fork /opt/local/sge-6.0/mpi/sge-lam
[1]  11488 sge-lam qrsh-local /usr/bin/lamd -H -P 58387 -n 0
-o 0 -d -sessionsuffix sge-38136-undefined

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

lamboot: wipe -- nothing to do
Stoping SGE + LAM Integration

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list