[GE users] Loose Integration LAM using ssh Sun Grid engine

Reuti reuti at staff.uni-marburg.de
Fri Mar 31 09:19:29 BST 2006


Hi,

Am 30.03.2006 um 13:51 schrieb j_reichel at freenet.de:

> Hi Reuti,
>
> thanx for answering my post.
>
> The Version of LAM/MPI is LAM 7.1.1/MPI
>  and we are using Debian GNU/Linux
>
> I tryed to start the lamboot without sge ans made a small  
> hostfile,this works fine with all, and with lamnodes, all nodes of  
> theenvriroment are shown.
>
> When i start a job in SGE the lamboot works, how you could see in  
> thelast posting, and then there is the failure that lamd ist not  
> runningon the head, also lamnodes after that only show that there  
> is no lamdrunning on the machine.
>
> My application i made works right on an other SGE6.0 Cluster so  
> thereis no error in this application. It ist only a simple  
> application whichcounts the nodes and give in the end the number of  
> all nodes back.
>
> There is only the output in job.e?? file:
> ---------------------------------------------------------------------- 
> -------
> One of the processes started by mpirun has exited with a nonzero exit
> code.  This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 15182 failed on node n1 (141.35.13.120) due to signal 4.
> ---------------------------------------------------------------------- 
> -------
>
> So i still don't know what the problem is with the LAM.
> Do you have an idea why lamhalt won't work
>

the lamhalt isn't working, as (the local) daemon isn't working any  
longer. As you stated, that lamnodes in the jobscript is already not  
working, can you put just a:

ps -e f -o pid,ppid,pgrp,command

in the jobscript. This should list the local processes. Did you  
activated by accident the option in qconf -mconf:

execd_params ENABLE_ADDGRP_KILL

which could break the forking of the processes into daemon-land?

Cheers - Reuti


> regards Joerg
>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
> From: Reuti <reuti at staff.uni-marburg.de>
> Date: Tue, 28 Mar 2006 22:17:46 +0200
> Subject: [GE users] Loose Integration LAM using ssh Sun Grid engine
>
> Hi Joerg,
>
> there is no need to send this posting additonally in PM. Only
> advantageous fact in this case was the LAM/MPI output I got only in
> PM that signal 4 was raised - SIGILL. Seems that your application
> crashed and trashed the LAM universe, so the lamhalt won't work.
>
> So the question is: what causes this? Which version of LAM/MPI are
> you using on which type of machines and which Linux distribution/
> version? Is it a self-compiled LAM/MPI or already included in the
> distribution? Was the application compiled with the actual mpicc/
> mpif77 version installed on the system?
>
> You could try the following, just use a script just with:
>
> #!/bin/sh
> export PATH=...(according to your LAM/MPI installation)
> lamnodes
> exit 0
>
> Are the listed nodes correct? Then you could use the small mpihello.c
> to check whether it's working in principle, before you try your
> application.
>
> HTH - Reuti
>
> PS: Don't use -unique to start the LAM universe. If SGE assigns two
> slots on one node to the job, LAM/MPI has no chance to know it. For a
> loose integration also control_slaves should stay as false.
>
>
> Am 28.03.2006 um 16:33 schrieb j_reichel at freenet.de:
>
>> Hello,
>>
>> i'am trying to integrate LAM to SGE 6.0. But it won't work in the
>> right way.
>> I have an startlam script and i add a new Parallel Enviroment into
>> the SGE.
>> But after sending the job there is no result.
>> I think there is a problem with the lamboot command.
>> I started it with the option -d to see what happens.
>> When i look to the logfile i can see that the lamd daemon is
>> startet on all the Nodes of the cluster.
>>
>> But after all in the last part of the logfile ist the comment that
>> there is no lamd on the head node.
>>
>> Do you have any idea?
>>
>> Here are the logfile an startlam script an PE of SGE:
>>
>> logfile:
>>
>> n-1<10054> ssi:boot:open: opening
>> n-1<10054> ssi:boot:open: looking for boot module named rsh
>> n-1<10054> ssi:boot:open: opening boot module rsh
>> n-1<10054> ssi:boot:open: opened boot module rsh
>> n-1<10054> ssi:boot:select: initializing boot module rsh
>> n-1<10054> ssi:boot:rsh: module initializing
>> n-1<10054> ssi:boot:rsh:agent: ssh -x
>> n-1<10054> ssi:boot:rsh:username: <same>
>> n-1<10054> ssi:boot:rsh:verbose: 1000
>> n-1<10054> ssi:boot:rsh:algorithm: linear
>> n-1<10054> ssi:boot:rsh:no_n: 0
>> n-1<10054> ssi:boot:rsh:no_profile: 0
>> n-1<10054> ssi:boot:rsh:fast: 0
>> n-1<10054> ssi:boot:rsh:ignore_stderr: 0
>> n-1<10054> ssi:boot:rsh:priority: 10
>> n-1<10054> ssi:boot:select: boot module available: rsh, priority: 10
>> n-1<10054> ssi:boot:select: selected boot module rsh
>> n-1<10054> ssi:boot:base: looking for boot schema in following
>> directories:
>> n-1<10054> ssi:boot:base:   <current directory>
>> n-1<10054> ssi:boot:base:   $TROLLIUSHOME/etc
>> n-1<10054> ssi:boot:base:   $LAMHOME/etc
>> n-1<10054> ssi:boot:base:   /usr/lib/lam/etc
>> n-1<10054> ssi:boot:base: looking for boot schema file:
>> n-1<10054> ssi:boot:base:   /tmp/78.1.all.q/machines
>> n-1<10054> ssi:boot:base: found boot schema: /tmp/78.1.all.q/machines
>> n-1<10054> ssi:boot:rsh: found the following hosts:
>> n-1<10054> ssi:boot:rsh:   n0 ppc207 (cpu=1)
>> n-1<10054> ssi:boot:rsh:   n1 ppc211 (cpu=1)
>> n-1<10054> ssi:boot:rsh:   n2 ppc203 (cpu=1)
>> n-1<10054> ssi:boot:rsh:   n3 ppc205 (cpu=1)
>> n-1<10054> ssi:boot:rsh:   n4 ppc228 (cpu=1)
>> n-1<10054> ssi:boot:rsh:   n5 ppc208 (cpu=1)
>> n-1<10054> ssi:boot:rsh:   n6 ppc206 (cpu=1)
>> n-1<10054> ssi:boot:rsh:   n7 ppc229 (cpu=1)
>> n-1<10054> ssi:boot:rsh:   n8 ppc231 (cpu=1)
>> n-1<10054> ssi:boot:rsh: resolved hosts:
>> n-1<10054> ssi:boot:rsh:   n0 ppc207 --> 141.35.13.107 (origin)
>> n-1<10054> ssi:boot:rsh:   n1 ppc211 --> 141.35.13.111
>> n-1<10054> ssi:boot:rsh:   n2 ppc203 --> 141.35.13.103
>> n-1<10054> ssi:boot:rsh:   n3 ppc205 --> 141.35.13.105
>> n-1<10054> ssi:boot:rsh:   n4 ppc228 --> 141.35.13.119
>> n-1<10054> ssi:boot:rsh:   n5 ppc208 --> 141.35.13.108
>> n-1<10054> ssi:boot:rsh:   n6 ppc206 --> 141.35.13.106
>> n-1<10054> ssi:boot:rsh:   n7 ppc229 --> 141.35.13.120
>> n-1<10054> ssi:boot:rsh:   n8 ppc231 --> 141.35.13.122
>> n-1<10054> ssi:boot:rsh: starting RTE procs
>> n-1<10054> ssi:boot:base:linear: starting
>> n-1<10054> ssi:boot:base:server: opening server TCP socket
>> n-1<10054> ssi:boot:base:server: opened port 32789
>> n-1<10054> ssi:boot:base:linear: booting n0 (ppc207)
>> n-1<10054> ssi:boot:rsh: starting lamd on (ppc207)
>> n-1<10054> ssi:boot:rsh: starting on n0 (ppc207): hboot -t -c lam-
>> conf.lamd -d -
>> sessionsuffix sge-78-undefined -I -H 141.35.13.107 -P 32789 -n 0 -o 0
>> n-1<10054> ssi:boot:rsh: launching locally
>> n-1<10057> ssi:boot:open: opening
>> n-1<10057> ssi:boot:open: looking for boot module named rsh
>> n-1<10057> ssi:boot:open: opening boot module rsh
>> n-1<10057> ssi:boot:open: opened boot module rsh
>> n-1<10057> ssi:boot:select: initializing boot module rsh
>> n-1<10057> ssi:boot:rsh: module initializing
>> n-1<10057> ssi:boot:rsh:agent: ssh -x
>> n-1<10057> ssi:boot:rsh:username: <same>
>> n-1<10057> ssi:boot:rsh:verbose: 1000
>> n-1<10057> ssi:boot:rsh:algorithm: linear
>> n-1<10057> ssi:boot:rsh:no_n: 0
>> n-1<10057> ssi:boot:rsh:no_profile: 0
>> n-1<10057> ssi:boot:rsh:fast: 0
>> n-1<10057> ssi:boot:rsh:ignore_stderr: 0
>> n-1<10057> ssi:boot:rsh:priority: 10
>> n-1<10057> ssi:boot:select: boot module available: rsh, priority: 10
>> n-1<10057> ssi:boot:select: selected boot module rsh
>> n-1<10057> ssi:boot:send_lamd: getting node ID from command line
>> n-1<10057> ssi:boot:send_lamd: getting agent haddr from command line
>> n-1<10057> ssi:boot:send_lamd: getting agent port from command line
>> n-1<10057> ssi:boot:send_lamd: getting node ID from command line
>> n-1<10057> ssi:boot:send_lamd: connecting to 141.35.13.107:32789,
>> node id 0
>> n-1<10057> ssi:boot:send_lamd: sending dli_port 32811
>> n-1<10054> ssi:boot:rsh: successfully launched on n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: expecting connection from finite  
>> list
>> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.107
>> n-1<10054> ssi:boot:base:server: this connection is expected (n0)
>> n-1<10054> ssi:boot:base:server: remote lamd is at  
>> 141.35.13.107:32811
>> n-1<10054> ssi:boot:base:linear: booting n1 (ppc211)
>> n-1<10054> ssi:boot:rsh: starting lamd on (ppc211)
>> n-1<10054> ssi:boot:rsh: starting on n1 (ppc211): hboot -t -c lam-
>> conf.lamd -d -
>> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n
>> 1 -o 0"
>> n-1<10054> ssi:boot:rsh: launching remotely
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n
>> 'echo $SHELL'
>> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc211 -n
>> hboot -t -c lam
>> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H
>> 141.35.13.107 -P 32789
>> -n 1 -o 0"'
>> n-1<10054> ssi:boot:rsh: successfully launched on n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: expecting connection from finite  
>> list
>> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.111
>> n-1<10054> ssi:boot:base:server: this connection is expected (n1)
>> n-1<10054> ssi:boot:base:server: remote lamd is at  
>> 141.35.13.111:32803
>> n-1<10054> ssi:boot:base:linear: booting n2 (ppc203)
>> n-1<10054> ssi:boot:rsh: starting lamd on (ppc203)
>> n-1<10054> ssi:boot:rsh: starting on n2 (ppc203): hboot -t -c lam-
>> conf.lamd -d -
>> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n
>> 2 -o 0"
>> n-1<10054> ssi:boot:rsh: launching remotely
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n
>> 'echo $SHELL'
>> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc203 -n
>> hboot -t -c lam
>> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H
>> 141.35.13.107 -P 32789
>> -n 2 -o 0"'
>> n-1<10054> ssi:boot:rsh: successfully launched on n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: expecting connection from finite  
>> list
>> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.103
>> n-1<10054> ssi:boot:base:server: this connection is expected (n2)
>> n-1<10054> ssi:boot:base:server: remote lamd is at  
>> 141.35.13.103:32840
>> n-1<10054> ssi:boot:base:linear: booting n3 (ppc205)
>> n-1<10054> ssi:boot:rsh: starting lamd on (ppc205)
>> n-1<10054> ssi:boot:rsh: starting on n3 (ppc205): hboot -t -c lam-
>> conf.lamd -d -
>> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n
>> 3 -o 0"
>> n-1<10054> ssi:boot:rsh: launching remotely
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n
>> 'echo $SHELL'
>> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc205 -n
>> hboot -t -c lam
>> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H
>> 141.35.13.107 -P 32789
>> -n 3 -o 0"'
>> n-1<10054> ssi:boot:rsh: successfully launched on n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: expecting connection from finite  
>> list
>> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.105
>> n-1<10054> ssi:boot:base:server: this connection is expected (n3)
>> n-1<10054> ssi:boot:base:server: remote lamd is at  
>> 141.35.13.105:32812
>> n-1<10054> ssi:boot:base:linear: booting n4 (ppc228)
>> n-1<10054> ssi:boot:rsh: starting lamd on (ppc228)
>> n-1<10054> ssi:boot:rsh: starting on n4 (ppc228): hboot -t -c lam-
>> conf.lamd -d -
>> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n
>> 4 -o 0"
>> n-1<10054> ssi:boot:rsh: launching remotely
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n
>> 'echo $SHELL'
>> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc228 -n
>> hboot -t -c lam
>> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H
>> 141.35.13.107 -P 32789
>> -n 4 -o 0"'
>> n-1<10054> ssi:boot:rsh: successfully launched on n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: expecting connection from finite  
>> list
>> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.119
>> n-1<10054> ssi:boot:base:server: this connection is expected (n4)
>> n-1<10054> ssi:boot:base:server: remote lamd is at  
>> 141.35.13.119:32806
>> n-1<10054> ssi:boot:base:linear: booting n5 (ppc208)
>> n-1<10054> ssi:boot:rsh: starting lamd on (ppc208)
>> n-1<10054> ssi:boot:rsh: starting on n5 (ppc208): hboot -t -c lam-
>> conf.lamd -d -
>> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n
>> 5 -o 0"
>> n-1<10054> ssi:boot:rsh: launching remotely
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n
>> 'echo $SHELL'
>> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc208 -n
>> hboot -t -c lam
>> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H
>> 141.35.13.107 -P 32789
>> -n 5 -o 0"'
>> n-1<10054> ssi:boot:rsh: successfully launched on n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: expecting connection from finite  
>> list
>> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.108
>> n-1<10054> ssi:boot:base:server: this connection is expected (n5)
>> n-1<10054> ssi:boot:base:server: remote lamd is at  
>> 141.35.13.108:32821
>> n-1<10054> ssi:boot:base:linear: booting n6 (ppc206)
>> n-1<10054> ssi:boot:rsh: starting lamd on (ppc206)
>> n-1<10054> ssi:boot:rsh: starting on n6 (ppc206): hboot -t -c lam-
>> conf.lamd -d -
>> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n
>> 6 -o 0"
>> n-1<10054> ssi:boot:rsh: launching remotely
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n
>> 'echo $SHELL'
>> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc206 -n
>> hboot -t -c lam
>> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H
>> 141.35.13.107 -P 32789
>> -n 6 -o 0"'
>> n-1<10054> ssi:boot:rsh: successfully launched on n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: expecting connection from finite  
>> list
>> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.106
>> n-1<10054> ssi:boot:base:server: this connection is expected (n6)
>> n-1<10054> ssi:boot:base:server: remote lamd is at  
>> 141.35.13.106:32807
>> n-1<10054> ssi:boot:base:linear: booting n7 (ppc229)
>> n-1<10054> ssi:boot:rsh: starting lamd on (ppc229)
>> n-1<10054> ssi:boot:rsh: starting on n7 (ppc229): hboot -t -c lam-
>> conf.lamd -d -
>> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n
>> 7 -o 0"
>> n-1<10054> ssi:boot:rsh: launching remotely
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n
>> 'echo $SHELL'
>> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc229 -n
>> hboot -t -c lam
>> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H
>> 141.35.13.107 -P 32789
>> -n 7 -o 0"'
>> n-1<10054> ssi:boot:rsh: successfully launched on n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: expecting connection from finite  
>> list
>> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.120
>> n-1<10054> ssi:boot:base:server: this connection is expected (n7)
>> n-1<10054> ssi:boot:base:server: remote lamd is at  
>> 141.35.13.120:32798
>> n-1<10054> ssi:boot:base:linear: booting n8 (ppc231)
>> n-1<10054> ssi:boot:rsh: starting lamd on (ppc231)
>> n-1<10054> ssi:boot:rsh: starting on n8 (ppc231): hboot -t -c lam-
>> conf.lamd -d -
>> sessionsuffix sge-78-undefined -s -I "-H 141.35.13.107 -P 32789 -n
>> 8 -o 0"
>> n-1<10054> ssi:boot:rsh: launching remotely
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n
>> 'echo $SHELL'
>> n-1<10054> ssi:boot:rsh: remote shell /usr/local/bin/bash
>> n-1<10054> ssi:boot:rsh: attempting to execute: ssh -x ppc231 -n
>> hboot -t -c lam
>> -conf.lamd -d -sessionsuffix sge-78-undefined -s -I '"-H
>> 141.35.13.107 -P 32789
>> -n 8 -o 0"'
>> n-1<10054> ssi:boot:rsh: successfully launched on n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: expecting connection from finite  
>> list
>> n-1<10054> ssi:boot:base:server: got connection from 141.35.13.122
>> n-1<10054> ssi:boot:base:server: this connection is expected (n8)
>> n-1<10054> ssi:boot:base:server: remote lamd is at  
>> 141.35.13.122:34876
>> n-1<10054> ssi:boot:base:server: closing server socket
>> n-1<10054> ssi:boot:base:server: connecting to lamd at
>> 141.35.13.107:32790
>> n-1<10054> ssi:boot:base:server: connected
>> n-1<10054> ssi:boot:base:server: sending number of links (9)
>> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: finished sending
>> n-1<10054> ssi:boot:base:server: disconnected from  
>> 141.35.13.107:32790
>> n-1<10054> ssi:boot:base:server: connecting to lamd at
>> 141.35.13.111:32784
>> n-1<10054> ssi:boot:base:server: connected
>> n-1<10054> ssi:boot:base:server: sending number of links (9)
>> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: finished sending
>> n-1<10054> ssi:boot:base:server: disconnected from  
>> 141.35.13.111:32784
>> n-1<10054> ssi:boot:base:server: connecting to lamd at
>> 141.35.13.103:32795
>> n-1<10057> ssi:boot:rsh: finalizing
>> n-1<10057> ssi:boot: Closing
>> n-1<10054> ssi:boot:base:server: connected
>> n-1<10054> ssi:boot:base:server: sending number of links (9)
>> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: finished sending
>> n-1<10054> ssi:boot:base:server: disconnected from  
>> 141.35.13.103:32795
>> n-1<10054> ssi:boot:base:server: connecting to lamd at
>> 141.35.13.105:32792
>> n-1<10054> ssi:boot:base:server: connected
>> n-1<10054> ssi:boot:base:server: sending number of links (9)
>> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: finished sending
>> n-1<10054> ssi:boot:base:server: disconnected from  
>> 141.35.13.105:32792
>> n-1<10054> ssi:boot:base:server: connecting to lamd at
>> 141.35.13.119:32792
>> n-1<10054> ssi:boot:base:server: connected
>> n-1<10054> ssi:boot:base:server: sending number of links (9)
>> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: finished sending
>> n-1<10054> ssi:boot:base:server: disconnected from  
>> 141.35.13.119:32792
>> n-1<10054> ssi:boot:base:server: connecting to lamd at
>> 141.35.13.108:32793
>> n-1<10054> ssi:boot:base:server: connected
>> n-1<10054> ssi:boot:base:server: sending number of links (9)
>> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: finished sending
>> n-1<10054> ssi:boot:base:server: disconnected from  
>> 141.35.13.108:32793
>> n-1<10054> ssi:boot:base:server: connecting to lamd at
>> 141.35.13.106:32788
>> n-1<10054> ssi:boot:base:server: connected
>> n-1<10054> ssi:boot:base:server: sending number of links (9)
>> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: finished sending
>> n-1<10054> ssi:boot:base:server: disconnected from  
>> 141.35.13.106:32788
>> n-1<10054> ssi:boot:base:server: connecting to lamd at
>> 141.35.13.120:54488
>> n-1<10054> ssi:boot:base:server: connected
>> n-1<10054> ssi:boot:base:server: sending number of links (9)
>> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: finished sending
>> n-1<10054> ssi:boot:base:server: disconnected from  
>> 141.35.13.120:54488
>> n-1<10054> ssi:boot:base:server: connecting to lamd at
>> 141.35.13.122:56713
>> n-1<10054> ssi:boot:base:server: connected
>> n-1<10054> ssi:boot:base:server: sending number of links (9)
>> n-1<10054> ssi:boot:base:server: sending info: n0 (ppc207)
>> n-1<10054> ssi:boot:base:server: sending info: n1 (ppc211)
>> n-1<10054> ssi:boot:base:server: sending info: n2 (ppc203)
>> n-1<10054> ssi:boot:base:server: sending info: n3 (ppc205)
>> n-1<10054> ssi:boot:base:server: sending info: n4 (ppc228)
>> n-1<10054> ssi:boot:base:server: sending info: n5 (ppc208)
>> n-1<10054> ssi:boot:base:server: sending info: n6 (ppc206)
>> n-1<10054> ssi:boot:base:server: sending info: n7 (ppc229)
>> n-1<10054> ssi:boot:base:server: sending info: n8 (ppc231)
>> n-1<10054> ssi:boot:base:server: finished sending
>> n-1<10054> ssi:boot:base:server: disconnected from  
>> 141.35.13.122:56713
>> n-1<10054> ssi:boot:base:linear: finished
>> n-1<10054> ssi:boot:rsh: all RTE procs started
>> n-1<10054> ssi:boot:rsh: finalizing
>> n-1<10054> ssi:boot: Closing
>> --------------------------------------------------------------------- 
>> -
>> -------
>> It seems that there is no lamd running on the host ppc207.
>>
>> This indicates that the LAM/MPI runtime environment is not operating.
>> The LAM/MPI runtime environment is necessary for the "lamhalt"
>> command.
>>
>> Please run the "lamboot" command the start the LAM/MPI runtime
>> environment.  See the LAM/MPI documentation for how to invoke
>> "lamboot" across multiple machines.
>> --------------------------------------------------------------------- 
>> -
>> -------
>>
>>
>> startlam script:
>>
>> #!/bin/sh
>> #
>> #
>> # (c) 2002 Sun Microsystems, Inc. Use is subject to license terms.
>>
>> #
>> # preparation of the mpi machine file
>> #
>> # usage: startmpi.sh [options] <pe_hostfile>
>> #
>> #        options are:
>> #                    -catch_hostname
>> #                     force use of hostname wrapper in $TMPDIR when
>> startingmpirun
>> #                    -catch_rsh
>> #                     force use of rsh wrapper in $TMPDIR when
>> starting mpirun
>> #                    -unique
>> #                     generate a machinefile where each hostname
>> appears only once
>> #                     This is needed to setup a multithreaded mpi
>> application
>> #
>>
>> PeHostfile2MachineFile()
>> {
>>    cat $1 | while read line; do
>>       # echo $line
>>       host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
>>       nslots=`echo $line|cut -f2 -d" "`
>>       i=1
>>       while [ $i -le $nslots ]; do
>>          # add here code to map regular hostnames into ATM hostnames
>>          echo $host
>>          i=`expr $i + 1`
>>       done
>>    done
>> }
>>
>>
>> #
>> # startup of LAM conforming with the Grid Engine
>> # Parallel Environment interface
>> #
>> # on success the job will find a machine-file in $TMPDIR/machines
>> #
>>
>> # useful to control parameters passed to us
>> echo $*
>>
>> # parse options
>> catch_rsh=0
>> catch_hostname=0
>> unique=0
>> while [ "$1" != "" ]; do
>>    case "$1" in
>>       -catch_rsh)
>>          catch_rsh=1
>>          ;;
>>       -catch_hostname)
>>          catch_hostname=1
>>          ;;
>>       -unique)
>>          unique=1
>>          ;;
>>       *)
>>          break;
>>          ;;
>>    esac
>>    shift
>> done
>>
>> me=`basename $0`
>>
>> # test number of args
>> if [ $# -ne 1 ]; then
>>    echo "$me: got wrong number of arguments" >&2
>>    exit 1
>> fi
>>
>> # get arguments
>> pe_hostfile=$1
>>
>> # ensure pe_hostfile is readable
>> if [ ! -r $pe_hostfile ]; then
>>    echo "$me: can't read $pe_hostfile" >&2
>>    exit 1
>> fi
>>
>> # create machine-file
>> # remove column with number of slots per queue
>> # mpi does not support them in this form
>> machines="$TMPDIR/machines"
>>
>> if [ $unique = 1 ]; then
>>    PeHostfile2MachineFile $pe_hostfile | uniq >> $machines
>> else
>>    PeHostfile2MachineFile $pe_hostfile >> $machines
>> fi
>>
>> # trace machines file
>> cat $machines
>>
>> #
>> # Make script wrapper for 'rsh' available in jobs tmp dir
>> #
>> if [ $catch_rsh = 1 ]; then
>>    rsh_wrapper=$SGE_ROOT/lam_loose_rsh/rsh
>>    if [ ! -x $rsh_wrapper ]; then
>>       echo "$me: can't execute $rsh_wrapper" >&2
>>       echo "     maybe itresides at a file system not available at
>> this machine" >&2
>>       exit 1
>>    fi
>>
>>    rshcmd=rsh
>>    case "$ARC" in
>>       hp|hp10|hp11|hp11-64) rshcmd=remsh ;;
>>       *) ;;
>>    esac
>>    # note: This could also be done using rcp, ftp or s.th.
>>    #       else. We use a symbolic link since it is the
>>    #       cheapest in case of a shared filesystem
>>    #
>>    ln -s $rsh_wrapper $TMPDIR/$rshcmd
>> fi
>>
>> #
>> # Make script wrapper for 'hostname' available in jobs tmp dir
>> #
>> if [ $catch_hostname = 1 ]; then
>>    hostname_wrapper=$SGE_ROOT/lam_loose_rsh/hostname
>>    if [ ! -x $hostname_wrapper ]; then
>>       echo "$me: can't execute $hostname_wrapper" >&2
>>       echo "     maybe itresides at a file system not available at
>> this machine" >&2
>>       exit 1
>>    fi
>>
>>    # note: This could also be done using rcp, ftp or s.th.
>>    #       else. We use a symbolic link since it is the
>>    #       cheapest in case of a shared filesystem
>>    #
>>    ln -s $hostname_wrapper $TMPDIR/hostname
>> fi
>>
>> #
>> # Extra LAM statement(s)
>> #
>> #if [ -z "`which lamboot 2>/dev/null`" ] ; then
>> #    export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
>> #fi
>> #lamboot -d -ssi boot rsh -ssi rsh_agent "ssh -x" $machines
>> # signal success to caller
>> lamboot -b -d -ssi boot rsh -ssi boot_rsh_agent "ssh -x" $machines
>> echo "lamboot beendet"
>> #signal success to caller
>> exit 0
>> case
>>
>>
>>
>>
>> PE in SGE:
>>
>> pe_name           lam7
>> slots             999
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /usr/local/grid/sge6.0/mpi/lam_loose_ssh/
>> startlam.sh -unique $pe_hostfile
>> stop_proc_args    /usr/local/grid/sge6.0/mpi/lam_loose_ssh/stoplam.sh
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>>
>> Regards
>>
>> Joerg
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list