[GE users] Loose Integration LAM using ssh Sun Grid engine

j_reichel at freenet.de j_reichel at freenet.de
Tue Mar 28 15:33:55 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

i'am trying to integrate LAM to SGE 6.0. But it won't work in the right way.
I have an startlam script and i add a new Parallel Enviroment into the SGE.
But after sending the job there is no result.
I think there is a problem with the lamboot command.
I started it with the option -d to see what happens.
When i look to the logfile i can see that the lamd daemon is startet on all the Nodes of the cluster.

But after all in the last part of the logfile ist the comment that there is no lamd on the head node.

Do you have any idea?

Here are the logfile an startlam script an PE of SGE:

logfile:

n-1<10054> ssi:boot:open: opening
n-1<10054> ssi:boot:open: looking for boot module named rsh
n-1<10054> ssi:boot:open: opening boot module rsh
n-1<10054> ssi:boot:open: opened boot module rsh
n-1<10054> ssi:boot:select: initializing boot module rsh
n-1<10054> ssi:boot:rsh: module initializing
n-1<10054> ssi:boot:rsh:agent: ssh -x
n-1<10054> ssi:boot:rsh:username: <same>
n-1<10054> ssi:boot:rsh:verbose: 1000
n-1<10054> ssi:boot:rsh:algorithm: linear
n-1<10054> ssi:boot:rsh:no_n: 0
n-1<10054> ssi:boot:rsh:no_profile: 0
n-1<10054> ssi:boot:rsh:fast: 0
n-1<10054> ssi:boot:rsh:ignore_stderr: 0
n-1<10054> ssi:boot:rsh:priority: 10
n-1<10054> ssi:boot:select: boot module available: rsh, priority: 10
n-1<10054> ssi:boot:select: selected boot module rsh
n-1<10054> ssi:boot:base: looking for boot schema in following directories:
n-1<10054> ssi:boot:base:   <current directory>
n-1<10054> ssi:boot:base:   $TROLLIUSHOME/etc
n-1<10054> ssi:boot:base:   $LAMHOME/etc
n-1<10054> ssi:boot:base:   /usr/lib/lam/etc
n-1<10054> ssi:boot:base: looking for boot schema file:
n-1<10054> ssi:boot:base:   /tmp/78.1.all.q/machines
n-1<10054> ssi:boot:base: found boot schema: /tmp/78.1.all.q/machines
n-1<10054> ssi:boot:rsh: found the following hosts:
n-1<10054> ssi:boot:rsh:   n0 ppc207 (cpu=1) 
n-1<10054> ssi:boot:rsh:   n1 ppc211 (cpu=1) 
n-1<10054> ssi:boot:rsh:   n2 ppc203 (cpu=1) 
n-1<10054> ssi:boot:rsh:   n3 ppc205 (cpu=1) 
n-1<10054> ssi:boot:rsh:   n4 ppc228 (cpu=1) 
n-1<10054> ssi:boot:rsh:   n5 ppc208 (cpu=1) 
n-1<10054> ssi:boot:rsh:   n6 ppc206 (cpu=1) 
n-1<10054> ssi:boot:rsh:   n7 ppc229 (cpu=1) 
n-1<10054> ssi:boot:rsh:   n8 ppc231 (cpu=1) 
n-1<10054> ssi:boot:rsh: resolved hosts:
n-1<10054> ssi:boot:rsh:   n0 ppc207 --> 141.35.13.107 (origin)
n-1<10054> ssi:boot:rsh:   n1 ppc211 --> 141.35.13.111
n-1<10054> ssi:boot:rsh:   n2 ppc203 --> 141.35.13.103
n-1<10054> ssi:boot:rsh:   n3 ppc205 --> 141.35.13.105
n-1<10054> ssi:boot:rsh:   n4 ppc228 --> 141.35.13.119
n-1<10054> ssi:boot:rsh:   n5 ppc208 --> 141.35.13.108
n-1<10054> ssi:boot:rsh:   n6 ppc206 --> 141.35.13.106
n-1<10054> ssi:boot:rsh:   n7 ppc229 --> 141.35.13.120
n-1<10054> ssi:boot:rsh:   n8 ppc231 --> 141.35.13.122
n-1<10054> ssi:boot:rsh: starting RTE procs
n-1<10054> ssi:boot:base:linear: starting
n-1<10054> ssi:boot:base:server: opening server TCP socket
n-1<10054> ssi:boot:base:server: opened port 32789
n-1<10054> ssi:boot:base:linear: booting n0 (ppc207)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc207)
n-1<10054> ssi:boot:rsh: starting on n0 (ppc207): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -I -H 141.35.13.107 -P 32789 -n 0 -o 0
n-1<10054> ssi:boot:rsh: launching locally
n-1<10057> ssi:boot:open: opening
n-1<10057> ssi:boot:open: looking for boot module named rsh
n-1<10057> ssi:boot:open: opening boot module rsh
n-1<10057> ssi:boot:open: opened boot module rsh
n-1<10057> ssi:boot:select: initializing boot module rsh
n-1<10057> ssi:boot:rsh: module initializing
n-1<10057> ssi:boot:rsh:agent: ssh -x
n-1<10057> ssi:boot:rsh:username: <same>
n-1<10057> ssi:boot:rsh:verbose: 1000
n-1<10057> ssi:boot:rsh:algorithm: linear
n-1<10057> ssi:boot:rsh:no_n: 0
n-1<10057> ssi:boot:rsh:no_profile: 0
n-1<10057> ssi:boot:rsh:fast: 0
n-1<10057> ssi:boot:rsh:ignore_stderr: 0
n-1<10057> ssi:boot:rsh:priority: 10
n-1<10057> ssi:boot:select: boot module available: rsh, priority: 10
n-1<10057> ssi:boot:select: selected boot module rsh
n-1<10057> ssi:boot:send_lamd: getting node ID from command line
n-1<10057> ssi:boot:send_lamd: getting agent haddr from command line
n-1<10057> ssi:boot:send_lamd: getting agent port from command line
n-1<10057> ssi:boot:send_lamd: getting node ID from command line
n-1<10057> ssi:boot:send_lamd: connecting to 141.35.13.107:32789, node id 0
n-1<10057> ssi:boot:send_lamd: sending dli_port 32811
n-1<10054> ssi:boot:rsh: successfully launched on n0 (ppc207)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.107
n-1<10054> ssi:boot:base:server: this connection is expected (n0)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.107:32811
n-1<10054> ssi:boot:base:linear: booting n1 (ppc211)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc211)
n-1<10054> ssi:boot:rsh: starting on n1 (ppc211): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 1 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789
-n 1 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n1 (ppc211)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.111
n-1<10054> ssi:boot:base:server: this connection is expected (n1)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.111:32803
n-1<10054> ssi:boot:base:linear: booting n2 (ppc203)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc203)
n-1<10054> ssi:boot:rsh: starting on n2 (ppc203): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 2 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 
-n 2 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n2 (ppc203)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.103
n-1<10054> ssi:boot:base:server: this connection is expected (n2)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.103:32840
n-1<10054> ssi:boot:base:linear: booting n3 (ppc205)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc205)
n-1<10054> ssi:boot:rsh: starting on n3 (ppc205): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 3 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 
-n 3 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n3 (ppc205)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.105
n-1<10054> ssi:boot:base:server: this connection is expected (n3)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.105:32812
n-1<10054> ssi:boot:base:linear: booting n4 (ppc228)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc228)
n-1<10054> ssi:boot:rsh: starting on n4 (ppc228): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 4 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 
-n 4 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n4 (ppc228)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.119
n-1<10054> ssi:boot:base:server: this connection is expected (n4)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.119:32806
n-1<10054> ssi:boot:base:linear: booting n5 (ppc208)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc208)
n-1<10054> ssi:boot:rsh: starting on n5 (ppc208): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 5 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 
-n 5 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n5 (ppc208)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.108
n-1<10054> ssi:boot:base:server: this connection is expected (n5)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.108:32821
n-1<10054> ssi:boot:base:linear: booting n6 (ppc206)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc206)
n-1<10054> ssi:boot:rsh: starting on n6 (ppc206): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 6 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 
-n 6 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n6 (ppc206)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.106
n-1<10054> ssi:boot:base:server: this connection is expected (n6)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.106:32807
n-1<10054> ssi:boot:base:linear: booting n7 (ppc229)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc229)
n-1<10054> ssi:boot:rsh: starting on n7 (ppc229): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 7 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 
-n 7 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n7 (ppc229)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.120
n-1<10054> ssi:boot:base:server: this connection is expected (n7)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.120:32798
n-1<10054> ssi:boot:base:linear: booting n8 (ppc231)
n-1<10054> ssi:boot:rsh: starting lamd on (ppc231)
n-1<10054> ssi:boot:rsh: starting on n8 (ppc231): hboot -t -c lam-conf.lamd -d -
sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n 8 -o 0"
n-1<10054> ssi:boot:rsh: launching remotely
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n 'echo $SHELL'
n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n hboot -t -c lam
-conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H 141.35.13.107 -P 32789 
-n 8 -o 0"'
n-1<10054> ssi:boot:rsh: successfully launched on n8 (ppc231)
n-1<10054> ssi:boot:base:server: expecting connection from finite list
n-1<10054> ssi:boot:base:server: got connection from 141.35.13.122
n-1<10054> ssi:boot:base:server: this connection is expected (n8)
n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.122:34876
n-1<10054> ssi:boot:base:server: closing server socket
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.107:32790
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.107:32790
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.111:32784
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.111:32784
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.103:32795
n-1<10057> ssi:boot:rsh: finalizing
n-1<10057> ssi:boot: Closing
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.103:32795
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.105:32792
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.105:32792
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.119:32792
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.119:32792
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.108:32793
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.108:32793
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.106:32788
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.106:32788
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.120:54488
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.120:54488
n-1<10054> ssi:boot:base:server: connecting to lamd at 141.35.13.122:56713
n-1<10054> ssi:boot:base:server: connected
n-1<10054> ssi:boot:base:server: sending number of links (9)
n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
n-1<10054> ssi:boot:base:server: finished sending
n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.122:56713
n-1<10054> ssi:boot:base:linear: finished
n-1<10054> ssi:boot:rsh: all RTE procs started
n-1<10054> ssi:boot:rsh: finalizing
n-1<10054> ssi:boot: Closing
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host ppc207.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "lamhalt" command.

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------


startlam script:

#!/bin/sh
#
#
# (c) 2002 Sun Microsystems, Inc. Use is subject to license terms.  

#
# preparation of the mpi machine file
#
# usage: startmpi.sh [options] <pe_hostfile>
#
#        options are: 
#                    -catch_hostname 
#                     force use of hostname wrapper in $TMPDIR when startingmpirun   
#                    -catch_rsh
#                     force use of rsh wrapper in $TMPDIR when starting mpirun   
#                    -unique
#                     generate a machinefile where each hostname appears only once
#                     This is needed to setup a multithreaded mpi application
#

PeHostfile2MachineFile()
{
   cat $1 | while read line; do
      # echo $line
      host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
      nslots=`echo $line|cut -f2 -d" "`
      i=1
      while [ $i -le $nslots ]; do
         # add here code to map regular hostnames into ATM hostnames
         echo $host
         i=`expr $i + 1`
      done
   done
}


#
# startup of LAM conforming with the Grid Engine
# Parallel Environment interface
#
# on success the job will find a machine-file in $TMPDIR/machines
# 

# useful to control parameters passed to us  
echo $*

# parse options
catch_rsh=0
catch_hostname=0
unique=0
while [ "$1" != "" ]; do
   case "$1" in
      -catch_rsh)
         catch_rsh=1
         ;;
      -catch_hostname)
         catch_hostname=1
         ;;
      -unique)
         unique=1
         ;;
      *)
         break;
         ;;
   esac
   shift
done

me=`basename $0`

# test number of args
if [ $# -ne 1 ]; then
   echo "$me: got wrong number of arguments" >&2
   exit 1
fi

# get arguments
pe_hostfile=$1

# ensure pe_hostfile is readable
if [ ! -r $pe_hostfile ]; then
   echo "$me: can't read $pe_hostfile" >&2
   exit 1
fi

# create machine-file
# remove column with number of slots per queue
# mpi does not support them in this form
machines="$TMPDIR/machines"

if [ $unique = 1 ]; then
   PeHostfile2MachineFile $pe_hostfile | uniq >> $machines
else
   PeHostfile2MachineFile $pe_hostfile >> $machines
fi

# trace machines file
cat $machines

#
# Make script wrapper for 'rsh' available in jobs tmp dir
#
if [ $catch_rsh = 1 ]; then
   rsh_wrapper=$SGE_ROOT/lam_loose_rsh/rsh
   if [ ! -x $rsh_wrapper ]; then
      echo "$me: can't execute $rsh_wrapper" >&2
      echo "     maybe itresides at a file system not available at this machine" >&2
      exit 1
   fi

   rshcmd=rsh
   case "$ARC" in
      hp|hp10|hp11|hp11-64) rshcmd=remsh ;;
      *) ;;
   esac
   # note: This could also be done using rcp, ftp or s.th.
   #       else. We use a symbolic link since it is the
   #       cheapest in case of a shared filesystem
   #
   ln -s $rsh_wrapper $TMPDIR/$rshcmd
fi

#
# Make script wrapper for 'hostname' available in jobs tmp dir
#
if [ $catch_hostname = 1 ]; then
   hostname_wrapper=$SGE_ROOT/lam_loose_rsh/hostname
   if [ ! -x $hostname_wrapper ]; then
      echo "$me: can't execute $hostname_wrapper" >&2
      echo "     maybe itresides at a file system not available at this machine" >&2
      exit 1
   fi

   # note: This could also be done using rcp, ftp or s.th.
   #       else. We use a symbolic link since it is the
   #       cheapest in case of a shared filesystem
   #
   ln -s $hostname_wrapper $TMPDIR/hostname
fi

#
# Extra LAM statement(s)
#
#if [ -z "`which lamboot 2>/dev/null`" ] ; then
#    export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
#fi
#lamboot -d -ssi boot rsh -ssi rsh_agent "ssh -x" $machines
# signal success to caller
lamboot -b -d -ssi boot rsh -ssi boot_rsh_agent "ssh -x" $machines
echo "lamboot beendet"
#signal success to caller
exit 0
case 




PE in SGE:

pe_name           lam7
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/local/grid/sge6.0/mpi/lam_loose_ssh/startlam.sh -unique $pe_hostfile
stop_proc_args    /usr/local/grid/sge6.0/mpi/lam_loose_ssh/stoplam.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Regards

Joerg

"Jetzt Handykosten senken mit klarmobil - 14 Ct./Min.! Hier klicken"
www.klarmobil.de/index.html?pid=73025



More information about the gridengine-users mailing list