[GE users] Loose Integration LAM using ssh Sun Grid engine

Reuti reuti at staff.uni-marburg.de
Tue Mar 28 21:17:46 BST 2006


Hi Joerg,

there is no need to send this posting additonally in PM. Only  
advantageous fact in this case was the LAM/MPI output I got only in  
PM that signal 4 was raised - SIGILL. Seems that your application  
crashed and trashed the LAM universe, so the lamhalt won't work.

So the question is: what causes this? Which version of LAM/MPI are  
you using on which type of machines and which Linux distribution/ 
version? Is it a self-compiled LAM/MPI or already included in the  
distribution? Was the application compiled with the actual mpicc/ 
mpif77 version installed on the system?

You could try the following, just use a script just with:

#!/bin/sh
export PATH=...(according to your LAM/MPI installation)
lamnodes
exit 0

Are the listed nodes correct? Then you could use the small mpihello.c  
to check whether it's working in principle, before you try your  
application.

HTH - Reuti

PS: Don't use -unique to start the LAM universe. If SGE assigns two  
slots on one node to the job, LAM/MPI has no chance to know it. For a  
loose integration also control_slaves should stay as false.


Am 28.03.2006 um 16:33 schrieb j_reichel at freenet.de:

> Hello,
>
> i'am trying to integrate LAM to SGE 6.0. But it won't work in the  
> right way.
> I have an startlam script and i add a new Parallel Enviroment into  
> the SGE.
> But after sending the job there is no result.
> I think there is a problem with the lamboot command.
> I started it with the option -d to see what happens.
> When i look to the logfile i can see that the lamd daemon is  
> startet on all the Nodes of the cluster.
>
> But after all in the last part of the logfile ist the comment that  
> there is no lamd on the head node.
>
> Do you have any idea?
>
> Here are the logfile an startlam script an PE of SGE:
>
> logfile:
>
> n-1<10054> ssi:boot:open: opening
> n-1<10054> ssi:boot:open: looking for boot module named rsh
> n-1<10054> ssi:boot:open: opening boot module rsh
> n-1<10054> ssi:boot:open: opened boot module rsh
> n-1<10054> ssi:boot:select: initializing boot module rsh
> n-1<10054> ssi:boot:rsh: module initializing
> n-1<10054> ssi:boot:rsh:agent: ssh -x
> n-1<10054> ssi:boot:rsh:username: <same>
> n-1<10054> ssi:boot:rsh:verbose: 1000
> n-1<10054> ssi:boot:rsh:algorithm: linear
> n-1<10054> ssi:boot:rsh:no_n: 0
> n-1<10054> ssi:boot:rsh:no_profile: 0
> n-1<10054> ssi:boot:rsh:fast: 0
> n-1<10054> ssi:boot:rsh:ignore_stderr: 0
> n-1<10054> ssi:boot:rsh:priority: 10
> n-1<10054> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<10054> ssi:boot:select: selected boot module rsh
> n-1<10054> ssi:boot:base: looking for boot schema in following  
> directories:
> n-1<10054> ssi:boot:base:   <current directory>
> n-1<10054> ssi:boot:base:   $TROLLIUSHOME/etc
> n-1<10054> ssi:boot:base:   $LAMHOME/etc
> n-1<10054> ssi:boot:base:   /usr/lib/lam/etc
> n-1<10054> ssi:boot:base: looking for boot schema file:
> n-1<10054> ssi:boot:base:   /tmp/78.1.all.q/machines
> n-1<10054> ssi:boot:base: found boot schema: /tmp/78.1.all.q/machines
> n-1<10054> ssi:boot:rsh: found the following hosts:
> n-1<10054> ssi:boot:rsh:   n0 ppc207 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n1 ppc211 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n2 ppc203 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n3 ppc205 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n4 ppc228 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n5 ppc208 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n6 ppc206 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n7 ppc229 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n8 ppc231 (cpu=1)
> n-1<10054> ssi:boot:rsh: resolved hosts:
> n-1<10054> ssi:boot:rsh:   n0 ppc207 --> 141.35.13.107 (origin)
> n-1<10054> ssi:boot:rsh:   n1 ppc211 --> 141.35.13.111
> n-1<10054> ssi:boot:rsh:   n2 ppc203 --> 141.35.13.103
> n-1<10054> ssi:boot:rsh:   n3 ppc205 --> 141.35.13.105
> n-1<10054> ssi:boot:rsh:   n4 ppc228 --> 141.35.13.119
> n-1<10054> ssi:boot:rsh:   n5 ppc208 --> 141.35.13.108
> n-1<10054> ssi:boot:rsh:   n6 ppc206 --> 141.35.13.106
> n-1<10054> ssi:boot:rsh:   n7 ppc229 --> 141.35.13.120
> n-1<10054> ssi:boot:rsh:   n8 ppc231 --> 141.35.13.122
> n-1<10054> ssi:boot:rsh: starting RTE procs
> n-1<10054> ssi:boot:base:linear: starting
> n-1<10054> ssi:boot:base:server: opening server TCP socket
> n-1<10054> ssi:boot:base:server: opened port 32789
> n-1<10054> ssi:boot:base:linear: booting n0 (ppc207)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc207)
> n-1<10054> ssi:boot:rsh: starting on n0 (ppc207): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -I -H 141.35.13.107 -P 32789 -n 0 -o 0
> n-1<10054> ssi:boot:rsh: launching locally
> n-1<10057> ssi:boot:open: opening
> n-1<10057> ssi:boot:open: looking for boot module named rsh
> n-1<10057> ssi:boot:open: opening boot module rsh
> n-1<10057> ssi:boot:open: opened boot module rsh
> n-1<10057> ssi:boot:select: initializing boot module rsh
> n-1<10057> ssi:boot:rsh: module initializing
> n-1<10057> ssi:boot:rsh:agent: ssh -x
> n-1<10057> ssi:boot:rsh:username: <same>
> n-1<10057> ssi:boot:rsh:verbose: 1000
> n-1<10057> ssi:boot:rsh:algorithm: linear
> n-1<10057> ssi:boot:rsh:no_n: 0
> n-1<10057> ssi:boot:rsh:no_profile: 0
> n-1<10057> ssi:boot:rsh:fast: 0
> n-1<10057> ssi:boot:rsh:ignore_stderr: 0
> n-1<10057> ssi:boot:rsh:priority: 10
> n-1<10057> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<10057> ssi:boot:select: selected boot module rsh
> n-1<10057> ssi:boot:send_lamd: getting node ID from command line
> n-1<10057> ssi:boot:send_lamd: getting agent haddr from command line
> n-1<10057> ssi:boot:send_lamd: getting agent port from command line
> n-1<10057> ssi:boot:send_lamd: getting node ID from command line
> n-1<10057> ssi:boot:send_lamd: connecting to 141.35.13.107:32789,  
> node id 0
> n-1<10057> ssi:boot:send_lamd: sending dli_port 32811
> n-1<10054> ssi:boot:rsh: successfully launched on n0 (ppc207)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.107
> n-1<10054> ssi:boot:base:server: this connection is expected (n0)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.107:32811
> n-1<10054> ssi:boot:base:linear: booting n1 (ppc211)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc211)
> n-1<10054> ssi:boot:rsh: starting on n1 (ppc211): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 1 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 1 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n1 (ppc211)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.111
> n-1<10054> ssi:boot:base:server: this connection is expected (n1)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.111:32803
> n-1<10054> ssi:boot:base:linear: booting n2 (ppc203)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc203)
> n-1<10054> ssi:boot:rsh: starting on n2 (ppc203): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 2 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 2 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n2 (ppc203)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.103
> n-1<10054> ssi:boot:base:server: this connection is expected (n2)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.103:32840
> n-1<10054> ssi:boot:base:linear: booting n3 (ppc205)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc205)
> n-1<10054> ssi:boot:rsh: starting on n3 (ppc205): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 3 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 3 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n3 (ppc205)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.105
> n-1<10054> ssi:boot:base:server: this connection is expected (n3)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.105:32812
> n-1<10054> ssi:boot:base:linear: booting n4 (ppc228)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc228)
> n-1<10054> ssi:boot:rsh: starting on n4 (ppc228): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 4 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 4 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n4 (ppc228)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.119
> n-1<10054> ssi:boot:base:server: this connection is expected (n4)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.119:32806
> n-1<10054> ssi:boot:base:linear: booting n5 (ppc208)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc208)
> n-1<10054> ssi:boot:rsh: starting on n5 (ppc208): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 5 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 5 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n5 (ppc208)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.108
> n-1<10054> ssi:boot:base:server: this connection is expected (n5)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.108:32821
> n-1<10054> ssi:boot:base:linear: booting n6 (ppc206)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc206)
> n-1<10054> ssi:boot:rsh: starting on n6 (ppc206): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 6 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 6 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n6 (ppc206)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.106
> n-1<10054> ssi:boot:base:server: this connection is expected (n6)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.106:32807
> n-1<10054> ssi:boot:base:linear: booting n7 (ppc229)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc229)
> n-1<10054> ssi:boot:rsh: starting on n7 (ppc229): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 7 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 7 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n7 (ppc229)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.120
> n-1<10054> ssi:boot:base:server: this connection is expected (n7)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.120:32798
> n-1<10054> ssi:boot:base:linear: booting n8 (ppc231)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc231)
> n-1<10054> ssi:boot:rsh: starting on n8 (ppc231): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 8 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 8 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n8 (ppc231)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.122
> n-1<10054> ssi:boot:base:server: this connection is expected (n8)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.122:34876
> n-1<10054> ssi:boot:base:server: closing server socket
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.107:32790
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.107:32790
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.111:32784
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.111:32784
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.103:32795
> n-1<10057> ssi:boot:rsh: finalizing
> n-1<10057> ssi:boot: Closing
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.103:32795
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.105:32792
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.105:32792
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.119:32792
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.119:32792
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.108:32793
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.108:32793
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.106:32788
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.106:32788
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.120:54488
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.120:54488
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.122:56713
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.122:56713
> n-1<10054> ssi:boot:base:linear: finished
> n-1<10054> ssi:boot:rsh: all RTE procs started
> n-1<10054> ssi:boot:rsh: finalizing
> n-1<10054> ssi:boot: Closing
> ---------------------------------------------------------------------- 
> -------
> It seems that there is no lamd running on the host ppc207.
>
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for the "lamhalt"  
> command.
>
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> ---------------------------------------------------------------------- 
> -------
>
>
> startlam script:
>
> #!/bin/sh
> #
> #
> # (c) 2002 Sun Microsystems, Inc. Use is subject to license terms.
>
> #
> # preparation of the mpi machine file
> #
> # usage: startmpi.sh [options] <pe_hostfile>
> #
> #        options are:
> #                    -catch_hostname
> #                     force use of hostname wrapper in $TMPDIR when  
> startingmpirun
> #                    -catch_rsh
> #                     force use of rsh wrapper in $TMPDIR when  
> starting mpirun
> #                    -unique
> #                     generate a machinefile where each hostname  
> appears only once
> #                     This is needed to setup a multithreaded mpi  
> application
> #
>
> PeHostfile2MachineFile()
> {
>    cat $1 | while read line; do
>       # echo $line
>       host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
>       nslots=`echo $line|cut -f2 -d" "`
>       i=1
>       while [ $i -le $nslots ]; do
>          # add here code to map regular hostnames into ATM hostnames
>          echo $host
>          i=`expr $i + 1`
>       done
>    done
> }
>
>
> #
> # startup of LAM conforming with the Grid Engine
> # Parallel Environment interface
> #
> # on success the job will find a machine-file in $TMPDIR/machines
> #
>
> # useful to control parameters passed to us
> echo $*
>
> # parse options
> catch_rsh=0
> catch_hostname=0
> unique=0
> while [ "$1" != "" ]; do
>    case "$1" in
>       -catch_rsh)
>          catch_rsh=1
>          ;;
>       -catch_hostname)
>          catch_hostname=1
>          ;;
>       -unique)
>          unique=1
>          ;;
>       *)
>          break;
>          ;;
>    esac
>    shift
> done
>
> me=`basename $0`
>
> # test number of args
> if [ $# -ne 1 ]; then
>    echo "$me: got wrong number of arguments" >&2
>    exit 1
> fi
>
> # get arguments
> pe_hostfile=$1
>
> # ensure pe_hostfile is readable
> if [ ! -r $pe_hostfile ]; then
>    echo "$me: can't read $pe_hostfile" >&2
>    exit 1
> fi
>
> # create machine-file
> # remove column with number of slots per queue
> # mpi does not support them in this form
> machines="$TMPDIR/machines"
>
> if [ $unique = 1 ]; then
>    PeHostfile2MachineFile $pe_hostfile | uniq >> $machines
> else
>    PeHostfile2MachineFile $pe_hostfile >> $machines
> fi
>
> # trace machines file
> cat $machines
>
> #
> # Make script wrapper for 'rsh' available in jobs tmp dir
> #
> if [ $catch_rsh = 1 ]; then
>    rsh_wrapper=$SGE_ROOT/lam_loose_rsh/rsh
>    if [ ! -x $rsh_wrapper ]; then
>       echo "$me: can't execute $rsh_wrapper" >&2
>       echo "     maybe itresides at a file system not available at  
> this machine" >&2
>       exit 1
>    fi
>
>    rshcmd=rsh
>    case "$ARC" in
>       hp|hp10|hp11|hp11-64) rshcmd=remsh ;;
>       *) ;;
>    esac
>    # note: This could also be done using rcp, ftp or s.th.
>    #       else. We use a symbolic link since it is the
>    #       cheapest in case of a shared filesystem
>    #
>    ln -s $rsh_wrapper $TMPDIR/$rshcmd
> fi
>
> #
> # Make script wrapper for 'hostname' available in jobs tmp dir
> #
> if [ $catch_hostname = 1 ]; then
>    hostname_wrapper=$SGE_ROOT/lam_loose_rsh/hostname
>    if [ ! -x $hostname_wrapper ]; then
>       echo "$me: can't execute $hostname_wrapper" >&2
>       echo "     maybe itresides at a file system not available at  
> this machine" >&2
>       exit 1
>    fi
>
>    # note: This could also be done using rcp, ftp or s.th.
>    #       else. We use a symbolic link since it is the
>    #       cheapest in case of a shared filesystem
>    #
>    ln -s $hostname_wrapper $TMPDIR/hostname
> fi
>
> #
> # Extra LAM statement(s)
> #
> #if [ -z "`which lamboot 2>/dev/null`" ] ; then
> #    export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
> #fi
> #lamboot -d -ssi boot rsh -ssi rsh_agent "ssh -x" $machines
> # signal success to caller
> lamboot -b -d -ssi boot rsh -ssi boot_rsh_agent "ssh -x" $machines
> echo "lamboot beendet"
> #signal success to caller
> exit 0
> case
>
>
>
>
> PE in SGE:
>
> pe_name           lam7
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/grid/sge6.0/mpi/lam_loose_ssh/ 
> startlam.sh -unique $pe_hostfile
> stop_proc_args    /usr/local/grid/sge6.0/mpi/lam_loose_ssh/stoplam.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> Regards
>
> Joerg
>
> "Jetzt Handykosten senken mit klarmobil - 14 Ct./Min.! Hier klicken"
> www.klarmobil.de/index.html?pid=73025

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list