[GE users] Loose Integration LAM using ssh Sun Grid engine

j_reichel at freenet.de j_reichel at freenet.de
Thu Mar 30 12:51:10 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,

thanx for answering my post.

The Version of LAM/MPI is LAM 7.1.1/MPI
 and we are using Debian GNU/Linux

I tryed to start the lamboot without sge ans made a small hostfile,this works fine with all, and with lamnodes, all nodes of theenvriroment are shown.

When i start a job in SGE the lamboot works, how you could see in thelast posting, and then there is the failure that lamd ist not runningon the head, also lamnodes after that only show that there is no lamdrunning on the machine.

My application i made works right on an other SGE6.0 Cluster so thereis no error in this application. It ist only a simple application whichcounts the nodes and give in the end the number of all nodes back.

There is only the output in job.e?? file:
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 15182 failed on node n1 (141.35.13.120) due to signal 4.
-----------------------------------------------------------------------------

So i still don't know what the problem is with the LAM.
Do you have an idea why lamhalt won't work

regards Joerg

Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
From: Reuti <reuti at staff.uni-marburg.de>
Date: Tue, 28 Mar 2006 22:17:46 +0200
Subject: [GE users] Loose Integration LAM using ssh Sun Grid engine

Hi Joerg,

there is no need to send this posting additonally in PM. Only  
advantageous fact in this case was the LAM/MPI output I got only in  
PM that signal 4 was raised - SIGILL. Seems that your application  
crashed and trashed the LAM universe, so the lamhalt won't work.

So the question is: what causes this? Which version of LAM/MPI are  
you using on which type of machines and which Linux distribution/ 
version? Is it a self-compiled LAM/MPI or already included in the  
distribution? Was the application compiled with the actual mpicc/ 
mpif77 version installed on the system?

You could try the following, just use a script just with:

#!/bin/sh
export PATH=...(according to your LAM/MPI installation)
lamnodes
exit 0

Are the listed nodes correct? Then you could use the small mpihello.c  
to check whether it's working in principle, before you try your  
application.

HTH - Reuti

PS: Don't use -unique to start the LAM universe. If SGE assigns two  
slots on one node to the job, LAM/MPI has no chance to know it. For a  
loose integration also control_slaves should stay as false.


Am 28.03.2006 um 16:33 schrieb j_reichel at freenet.de:

> Hello,
>
> i'am trying to integrate LAM to SGE 6.0. But it won't work in the  
> right way.
> I have an startlam script and i add a new Parallel Enviroment into  
> the SGE.
> But after sending the job there is no result.
> I think there is a problem with the lamboot command.
> I started it with the option -d to see what happens.
> When i look to the logfile i can see that the lamd daemon is  
> startet on all the Nodes of the cluster.
>
> But after all in the last part of the logfile ist the comment that  
> there is no lamd on the head node.
>
> Do you have any idea?
>
> Here are the logfile an startlam script an PE of SGE:
>
> logfile:
>
> n-1<10054> ssi:boot:open: opening
> n-1<10054> ssi:boot:open: looking for boot module named rsh
> n-1<10054> ssi:boot:open: opening boot module rsh
> n-1<10054> ssi:boot:open: opened boot module rsh
> n-1<10054> ssi:boot:select: initializing boot module rsh
> n-1<10054> ssi:boot:rsh: module initializing
> n-1<10054> ssi:boot:rsh:agent: ssh -x
> n-1<10054> ssi:boot:rsh:username: <same>
> n-1<10054> ssi:boot:rsh:verbose: 1000
> n-1<10054> ssi:boot:rsh:algorithm: linear
> n-1<10054> ssi:boot:rsh:no_n: 0
> n-1<10054> ssi:boot:rsh:no_profile: 0
> n-1<10054> ssi:boot:rsh:fast: 0
> n-1<10054> ssi:boot:rsh:ignore_stderr: 0
> n-1<10054> ssi:boot:rsh:priority: 10
> n-1<10054> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<10054> ssi:boot:select: selected boot module rsh
> n-1<10054> ssi:boot:base: looking for boot schema in following  
> directories:
> n-1<10054> ssi:boot:base:   <current directory>
> n-1<10054> ssi:boot:base:   $TROLLIUSHOME/etc
> n-1<10054> ssi:boot:base:   $LAMHOME/etc
> n-1<10054> ssi:boot:base:   /usr/lib/lam/etc
> n-1<10054> ssi:boot:base: looking for boot schema file:
> n-1<10054> ssi:boot:base:   /tmp/78.1.all.q/machines
> n-1<10054> ssi:boot:base: found boot schema: /tmp/78.1.all.q/machines
> n-1<10054> ssi:boot:rsh: found the following hosts:
> n-1<10054> ssi:boot:rsh:   n0 ppc207 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n1 ppc211 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n2 ppc203 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n3 ppc205 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n4 ppc228 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n5 ppc208 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n6 ppc206 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n7 ppc229 (cpu=1)
> n-1<10054> ssi:boot:rsh:   n8 ppc231 (cpu=1)
> n-1<10054> ssi:boot:rsh: resolved hosts:
> n-1<10054> ssi:boot:rsh:   n0 ppc207 --> 141.35.13.107 (origin)
> n-1<10054> ssi:boot:rsh:   n1 ppc211 --> 141.35.13.111
> n-1<10054> ssi:boot:rsh:   n2 ppc203 --> 141.35.13.103
> n-1<10054> ssi:boot:rsh:   n3 ppc205 --> 141.35.13.105
> n-1<10054> ssi:boot:rsh:   n4 ppc228 --> 141.35.13.119
> n-1<10054> ssi:boot:rsh:   n5 ppc208 --> 141.35.13.108
> n-1<10054> ssi:boot:rsh:   n6 ppc206 --> 141.35.13.106
> n-1<10054> ssi:boot:rsh:   n7 ppc229 --> 141.35.13.120
> n-1<10054> ssi:boot:rsh:   n8 ppc231 --> 141.35.13.122
> n-1<10054> ssi:boot:rsh: starting RTE procs
> n-1<10054> ssi:boot:base:linear: starting
> n-1<10054> ssi:boot:base:server: opening server TCP socket
> n-1<10054> ssi:boot:base:server: opened port 32789
> n-1<10054> ssi:boot:base:linear: booting n0 (ppc207)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc207)
> n-1<10054> ssi:boot:rsh: starting on n0 (ppc207): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -I -H 141.35.13.107 -P 32789 -n 0 -o 0
> n-1<10054> ssi:boot:rsh: launching locally
> n-1<10057> ssi:boot:open: opening
> n-1<10057> ssi:boot:open: looking for boot module named rsh
> n-1<10057> ssi:boot:open: opening boot module rsh
> n-1<10057> ssi:boot:open: opened boot module rsh
> n-1<10057> ssi:boot:select: initializing boot module rsh
> n-1<10057> ssi:boot:rsh: module initializing
> n-1<10057> ssi:boot:rsh:agent: ssh -x
> n-1<10057> ssi:boot:rsh:username: <same>
> n-1<10057> ssi:boot:rsh:verbose: 1000
> n-1<10057> ssi:boot:rsh:algorithm: linear
> n-1<10057> ssi:boot:rsh:no_n: 0
> n-1<10057> ssi:boot:rsh:no_profile: 0
> n-1<10057> ssi:boot:rsh:fast: 0
> n-1<10057> ssi:boot:rsh:ignore_stderr: 0
> n-1<10057> ssi:boot:rsh:priority: 10
> n-1<10057> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<10057> ssi:boot:select: selected boot module rsh
> n-1<10057> ssi:boot:send_lamd: getting node ID from command line
> n-1<10057> ssi:boot:send_lamd: getting agent haddr from command line
> n-1<10057> ssi:boot:send_lamd: getting agent port from command line
> n-1<10057> ssi:boot:send_lamd: getting node ID from command line
> n-1<10057> ssi:boot:send_lamd: connecting to 141.35.13.107:32789,  
> node id 0
> n-1<10057> ssi:boot:send_lamd: sending dli_port 32811
> n-1<10054> ssi:boot:rsh: successfully launched on n0 (ppc207)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.107
> n-1<10054> ssi:boot:base:server: this connection is expected (n0)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.107:32811
> n-1<10054> ssi:boot:base:linear: booting n1 (ppc211)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc211)
> n-1<10054> ssi:boot:rsh: starting on n1 (ppc211): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 1 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 1 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n1 (ppc211)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.111
> n-1<10054> ssi:boot:base:server: this connection is expected (n1)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.111:32803
> n-1<10054> ssi:boot:base:linear: booting n2 (ppc203)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc203)
> n-1<10054> ssi:boot:rsh: starting on n2 (ppc203): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 2 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 2 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n2 (ppc203)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.103
> n-1<10054> ssi:boot:base:server: this connection is expected (n2)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.103:32840
> n-1<10054> ssi:boot:base:linear: booting n3 (ppc205)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc205)
> n-1<10054> ssi:boot:rsh: starting on n3 (ppc205): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 3 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 3 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n3 (ppc205)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.105
> n-1<10054> ssi:boot:base:server: this connection is expected (n3)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.105:32812
> n-1<10054> ssi:boot:base:linear: booting n4 (ppc228)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc228)
> n-1<10054> ssi:boot:rsh: starting on n4 (ppc228): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 4 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 4 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n4 (ppc228)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.119
> n-1<10054> ssi:boot:base:server: this connection is expected (n4)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.119:32806
> n-1<10054> ssi:boot:base:linear: booting n5 (ppc208)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc208)
> n-1<10054> ssi:boot:rsh: starting on n5 (ppc208): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 5 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 5 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n5 (ppc208)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.108
> n-1<10054> ssi:boot:base:server: this connection is expected (n5)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.108:32821
> n-1<10054> ssi:boot:base:linear: booting n6 (ppc206)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc206)
> n-1<10054> ssi:boot:rsh: starting on n6 (ppc206): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 6 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 6 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n6 (ppc206)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.106
> n-1<10054> ssi:boot:base:server: this connection is expected (n6)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.106:32807
> n-1<10054> ssi:boot:base:linear: booting n7 (ppc229)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc229)
> n-1<10054> ssi:boot:rsh: starting on n7 (ppc229): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 7 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 7 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n7 (ppc229)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.120
> n-1<10054> ssi:boot:base:server: this connection is expected (n7)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.120:32798
> n-1<10054> ssi:boot:base:linear: booting n8 (ppc231)
> n-1<10054> ssi:boot:rsh: starting lamd on (ppc231)
> n-1<10054> ssi:boot:rsh: starting on n8 (ppc231): hboot -t -c lam- 
> conf.lamd -d -
> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n  
> 8 -o 0"
> n-1<10054> ssi:boot:rsh: launching remotely
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n  
> 'echo $SHELL'
> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n  
> hboot -t -c lam
> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H  
> 141.35.13.107 -P 32789
> -n 8 -o 0"'
> n-1<10054> ssi:boot:rsh: successfully launched on n8 (ppc231)
> n-1<10054> ssi:boot:base:server: expecting connection from finite list
> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.122
> n-1<10054> ssi:boot:base:server: this connection is expected (n8)
> n-1<10054> ssi:boot:base:server: remote lamd is at 141.35.13.122:34876
> n-1<10054> ssi:boot:base:server: closing server socket
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.107:32790
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.107:32790
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.111:32784
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.111:32784
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.103:32795
> n-1<10057> ssi:boot:rsh: finalizing
> n-1<10057> ssi:boot: Closing
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.103:32795
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.105:32792
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.105:32792
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.119:32792
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.119:32792
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.108:32793
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.108:32793
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.106:32788
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.106:32788
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.120:54488
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.120:54488
> n-1<10054> ssi:boot:base:server: connecting to lamd at  
> 141.35.13.122:56713
> n-1<10054> ssi:boot:base:server: connected
> n-1<10054> ssi:boot:base:server: sending number of links (9)
> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
> n-1<10054> ssi:boot:base:server: finished sending
> n-1<10054> ssi:boot:base:server: disconnected from 141.35.13.122:56713
> n-1<10054> ssi:boot:base:linear: finished
> n-1<10054> ssi:boot:rsh: all RTE procs started
> n-1<10054> ssi:boot:rsh: finalizing
> n-1<10054> ssi:boot: Closing
> ---------------------------------------------------------------------- 
> -------
> It seems that there is no lamd running on the host ppc207.
>
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for the "lamhalt"  
> command.
>
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> ---------------------------------------------------------------------- 
> -------
>
>
> startlam script:
>
> #!/bin/sh
> #
> #
> # (c) 2002 Sun Microsystems, Inc. Use is subject to license terms.
>
> #
> # preparation of the mpi machine file
> #
> # usage: startmpi.sh [options] <pe_hostfile>
> #
> #        options are:
> #                    -catch_hostname
> #                     force use of hostname wrapper in $TMPDIR when  
> startingmpirun
> #                    -catch_rsh
> #                     force use of rsh wrapper in $TMPDIR when  
> starting mpirun
> #                    -unique
> #                     generate a machinefile where each hostname  
> appears only once
> #                     This is needed to setup a multithreaded mpi  
> application
> #
>
> PeHostfile2MachineFile()
> {
>    cat $1 | while read line; do
>       # echo $line
>       host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
>       nslots=`echo $line|cut -f2 -d" "`
>       i=1
>       while [ $i -le $nslots ]; do
>          # add here code to map regular hostnames into ATM hostnames
>          echo $host
>          i=`expr $i + 1`
>       done
>    done
> }
>
>
> #
> # startup of LAM conforming with the Grid Engine
> # Parallel Environment interface
> #
> # on success the job will find a machine-file in $TMPDIR/machines
> #
>
> # useful to control parameters passed to us
> echo $*
>
> # parse options
> catch_rsh=0
> catch_hostname=0
> unique=0
> while [ "$1" != "" ]; do
>    case "$1" in
>       -catch_rsh)
>          catch_rsh=1
>          ;;
>       -catch_hostname)
>          catch_hostname=1
>          ;;
>       -unique)
>          unique=1
>          ;;
>       *)
>          break;
>          ;;
>    esac
>    shift
> done
>
> me=`basename $0`
>
> # test number of args
> if [ $# -ne 1 ]; then
>    echo "$me: got wrong number of arguments" >&2
>    exit 1
> fi
>
> # get arguments
> pe_hostfile=$1
>
> # ensure pe_hostfile is readable
> if [ ! -r $pe_hostfile ]; then
>    echo "$me: can't read $pe_hostfile" >&2
>    exit 1
> fi
>
> # create machine-file
> # remove column with number of slots per queue
> # mpi does not support them in this form
> machines="$TMPDIR/machines"
>
> if [ $unique = 1 ]; then
>    PeHostfile2MachineFile $pe_hostfile | uniq >> $machines
> else
>    PeHostfile2MachineFile $pe_hostfile >> $machines
> fi
>
> # trace machines file
> cat $machines
>
> #
> # Make script wrapper for 'rsh' available in jobs tmp dir
> #
> if [ $catch_rsh = 1 ]; then
>    rsh_wrapper=$SGE_ROOT/lam_loose_rsh/rsh
>    if [ ! -x $rsh_wrapper ]; then
>       echo "$me: can't execute $rsh_wrapper" >&2
>       echo "     maybe itresides at a file system not available at  
> this machine" >&2
>       exit 1
>    fi
>
>    rshcmd=rsh
>    case "$ARC" in
>       hp|hp10|hp11|hp11-64) rshcmd=remsh ;;
>       *) ;;
>    esac
>    # note: This could also be done using rcp, ftp or s.th.
>    #       else. We use a symbolic link since it is the
>    #       cheapest in case of a shared filesystem
>    #
>    ln -s $rsh_wrapper $TMPDIR/$rshcmd
> fi
>
> #
> # Make script wrapper for 'hostname' available in jobs tmp dir
> #
> if [ $catch_hostname = 1 ]; then
>    hostname_wrapper=$SGE_ROOT/lam_loose_rsh/hostname
>    if [ ! -x $hostname_wrapper ]; then
>       echo "$me: can't execute $hostname_wrapper" >&2
>       echo "     maybe itresides at a file system not available at  
> this machine" >&2
>       exit 1
>    fi
>
>    # note: This could also be done using rcp, ftp or s.th.
>    #       else. We use a symbolic link since it is the
>    #       cheapest in case of a shared filesystem
>    #
>    ln -s $hostname_wrapper $TMPDIR/hostname
> fi
>
> #
> # Extra LAM statement(s)
> #
> #if [ -z "`which lamboot 2>/dev/null`" ] ; then
> #    export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
> #fi
> #lamboot -d -ssi boot rsh -ssi rsh_agent "ssh -x" $machines
> # signal success to caller
> lamboot -b -d -ssi boot rsh -ssi boot_rsh_agent "ssh -x" $machines
> echo "lamboot beendet"
> #signal success to caller
> exit 0
> case
>
>
>
>
> PE in SGE:
>
> pe_name           lam7
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/grid/sge6.0/mpi/lam_loose_ssh/ 
> startlam.sh -unique $pe_hostfile
> stop_proc_args    /usr/local/grid/sge6.0/mpi/lam_loose_ssh/stoplam.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> Regards
>
> Joerg








More information about the gridengine-users mailing list