[GE users] Getting SGE working with Lam-7.1.1 on Mac OS 10.3

Jeff Deroshia jeff at hal.physast.uga.edu
Wed Mar 2 16:40:26 GMT 2005


I've been trying to get this going for a week; now it's time to ask the  
community.

I'm trying to set up Lam 7.1 with SGE 6.1 on an xserve cluster using  
tight integration.  I'm using the all-in-one sge-lam perl script with  
the built-in qrsh wrappers by Christopher Duncan that I found here:   
http://gridengine.sunsource.net/servlets/ReadMsg? 
msgId=19278&listName=users as the startup and shutdown script in the  
start_proc_args and stop_proc_args, respectively.

Here's my mpi queue in SGE:
node9:~ jeff$ qconf -sp mpi
pe_name           mpi
slots             32
user_lists        NONE
xuser_lists       NONE
start_proc_args   /home/sge/mpi/sge-lam start
stop_proc_args    /home/sge/mpi/sge-lam stop
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

In  LAMDIR/etc I have a file called sge-lam-conf.lamd:
node9:~ jeff$ cat sge-lam-conf.lamd
/home/sge/mpi/sge-lam qrsh-local /usr/local/lam-711/bin/lamd $inet_topo  
$debug $session_prefix $session_suffix

I'm trying to submit the following job:
node9:~ jeff$ cat lamtest-sge.sh
#$ -cwd
#$ -pe mpi 12-16
echo "-np will be set at $NSLOTS"
mpirun -np $NSLOTS mpitest

mpitest is a simple c program that prints "Hello world!" from each  
processor.

Here's the parallel error output I get after submitting the job:
node9:~ jeff$ cat lamtest-sge.sh.pe797
n-1<25159> ssi:boot:open: opening
n-1<25159> ssi:boot:open: looking for boot module named rsh
n-1<25159> ssi:boot:open: opening boot module rsh
n-1<25159> ssi:boot:open: opened boot module rsh
n-1<25159> ssi:boot:select: initializing boot module rsh
n-1<25159> ssi:boot:rsh: module initializing
n-1<25159> ssi:boot:rsh:agent: /home/sge/mpi/sge-lam qrsh-remote
n-1<25159> ssi:boot:rsh:username: <same>
n-1<25159> ssi:boot:rsh:verbose: 1000
n-1<25159> ssi:boot:rsh:algorithm: linear
n-1<25159> ssi:boot:rsh:no_n: 0
n-1<25159> ssi:boot:rsh:no_profile: 0
n-1<25159> ssi:boot:rsh:fast: 0
n-1<25159> ssi:boot:rsh:ignore_stderr: 0
n-1<25159> ssi:boot:rsh:priority: 10
n-1<25159> ssi:boot:select: boot module available: rsh, priority: 10
n-1<25159> ssi:boot:select: selected boot module rsh
n-1<25159> ssi:boot:base: looking for boot schema in following  
directories:
n-1<25159> ssi:boot:base:   <current directory>
n-1<25159> ssi:boot:base:   $TROLLIUSHOME/etc
n-1<25159> ssi:boot:base:   $LAMHOME/etc
n-1<25159> ssi:boot:base:   /usr/local/lam-711/etc
n-1<25159> ssi:boot:base: looking for boot schema file:
n-1<25159> ssi:boot:base:   /tmp/797.1.all.q/lamhostfile
n-1<25159> ssi:boot:base: found boot schema:  
/tmp/797.1.all.q/lamhostfile
n-1<25159> ssi:boot:rsh: found the following hosts:
n-1<25159> ssi:boot:rsh:   n0 node12 (cpu=2)
n-1<25159> ssi:boot:rsh:   n1 node13 (cpu=2)
n-1<25159> ssi:boot:rsh:   n2 node10 (cpu=2)
n-1<25159> ssi:boot:rsh:   n3 node11 (cpu=2)
n-1<25159> ssi:boot:rsh:   n4 node8 (cpu=2)
n-1<25159> ssi:boot:rsh:   n5 node16 (cpu=2)
n-1<25159> ssi:boot:rsh:   n6 node6 (cpu=2)
n-1<25159> ssi:boot:rsh:   n7 node7 (cpu=2)
n-1<25159> ssi:boot:rsh: resolved hosts:
n-1<25159> ssi:boot:rsh:   n0 node12 --> 192.168.0.12 (origin)
n-1<25159> ssi:boot:rsh:   n1 node13 --> 192.168.0.13
n-1<25159> ssi:boot:rsh:   n2 node10 --> 192.168.0.10
n-1<25159> ssi:boot:rsh:   n3 node11 --> 192.168.0.11
n-1<25159> ssi:boot:rsh:   n4 node8 --> 192.168.0.8
n-1<25159> ssi:boot:rsh:   n5 node16 --> 192.168.0.16
n-1<25159> ssi:boot:rsh:   n6 node6 --> 192.168.0.6
n-1<25159> ssi:boot:rsh:   n7 node7 --> 192.168.0.7
n-1<25159> ssi:boot:rsh: starting RTE procs
n-1<25159> ssi:boot:base:linear: starting
n-1<25159> ssi:boot:base:server: opening server TCP socket
n-1<25159> ssi:boot:base:server: opened port 49345
n-1<25159> ssi:boot:base:linear: booting n0 (node12)
n-1<25159> ssi:boot:rsh: starting lamd on (node12)
n-1<25159> ssi:boot:rsh: starting on n0 (node12): hboot -t -c  
sge-lam-conf.lamd -d -v -sessionsuffix sge-797-undefined -I -H  
192.168.0.12 -P 49345 -n 0 -o 0
n-1<25159> ssi:boot:rsh: launching locally
n-1<25159> ssi:boot:rsh: successfully launched on n0 (node12)
n-1<25159> ssi:boot:base:server: expecting connection from finite list
n-1<25176> ssi:boot:open: opening
n-1<25159> ssi:boot:base:server: got connection from 192.168.0.12
n-1<25159> ssi:boot:base:server: this connection is expected (n0)
n-1<25159> ssi:boot:base:server: remote lamd is at 192.168.0.12:57400
n-1<25159> ssi:boot:base:linear: booting n1 (node13)
n-1<25159> ssi:boot:rsh: starting lamd on (node13)
n-1<25159> ssi:boot:rsh: starting on n1 (node13): hboot -t -c  
sge-lam-conf.lamd -d -v -sessionsuffix sge-797-undefined -s -I "-H  
192.168.0.12 -P 49345 -n 1 -o 0"
n-1<25159> ssi:boot:rsh: launching remotely
n-1<25159> ssi:boot:rsh: attempting to execute: /home/sge/mpi/sge-lam  
qrsh-remote node13 -n 'echo $SHELL'
n-1<25176> ssi:boot:open: looking for boot module named rsh
n-1<25176> ssi:boot:open: opening boot module rsh
n-1<25176> ssi:boot:open: opened boot module rsh
n-1<25176> ssi:boot:select: initializing boot module rsh
n-1<25176> ssi:boot:rsh: module initializing
n-1<25176> ssi:boot:rsh:agent: /home/sge/mpi/sge-lam qrsh-remote
n-1<25176> ssi:boot:rsh:username: <same>
n-1<25176> ssi:boot:rsh:verbose: 1000
n-1<25176> ssi:boot:rsh:algorithm: linear
n-1<25176> ssi:boot:rsh:no_n: 0
n-1<25176> ssi:boot:rsh:no_profile: 0
n-1<25176> ssi:boot:rsh:fast: 0
n-1<25176> ssi:boot:rsh:ignore_stderr: 0
n-1<25176> ssi:boot:rsh:priority: 10
n-1<25176> ssi:boot:select: boot module available: rsh, priority: 10
n-1<25176> ssi:boot:select: selected boot module rsh
n-1<25176> ssi:boot:send_lamd: getting node ID from command line
n-1<25176> ssi:boot:send_lamd: getting agent haddr from command line
n-1<25176> ssi:boot:send_lamd: getting agent port from command line
n-1<25176> ssi:boot:send_lamd: getting node ID from command line
n-1<25176> ssi:boot:send_lamd: connecting to 192.168.0.12:49345, node  
id 0
n-1<25176> ssi:boot:send_lamd: sending dli_port 57400
ERROR: LAM/MPI unexpectedly received the following on stderr:
qrsh_starter: executing child process -n failed: No such file or  
directory
------------------------------------------------------------------------ 
-----
LAM failed to execute a process on the remote node "node13".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.

LAM tried to use the remote agent command "/home/sge/mpi/sge-lam"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This usually indicates an authentication problem with the remote
agent, some other configuration type of error in your .cshrc or
.profile file, or you were unable to executable a command on the
remote node for some other reason.  The following is a list of items
that you should check on the remote node:

         - You have an account and can login to the remote machine
         - Incorrect permissions on your home directory (should
           probably be 0755)
         - Incorrect permissions on your $HOME/.rhosts file (if you are
           using rsh -- they should probably be 0644)
         - You have an entry in the remote $HOME/.rhosts file (if you
           are using rsh) for the machine and username that you are
           running from
         - Your .cshrc/.profile must not print anything out to the
           standard error
         - Your .cshrc/.profile should set a correct TERM type
         - Your .cshrc/.profile should set the SHELL environment
           variable to your default shell

Try invoking the following command at the unix command line:

         /home/sge/mpi/sge-lam qrsh-remote node13 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
------------------------------------------------------------------------ 
-----
n-1<25159> ssi:boot:base:linear: Failed to boot n1 (node13)
n-1<25159> ssi:boot:base:server: closing server socket
n-1<25159> ssi:boot:base:linear: aborted!
------------------------------------------------------------------------ 
-----
Synopsis:       lamwipe [-d] [-h] [-H] [-v] [-V] [-nn] [-np]
                         [-prefix </lam/install/path/>] [-w <#>]  
[<bhost>]

Description:    This command has been obsoleted by the "lamhalt"  
command.
                 You should be using that instead.  However, "lamwipe"  
can
                 still be used to shut down a LAM universe.

Options:
         -b      Use the faster lamwipe algorithm; will only work if  
shell
                 on all remote nodes is same as shell on local node
         -d      Print debugging message (implies -v)
         -h      Print this message
         -H      Don't print the header
         -nn     Don't add "-n" to the remote agent command line
         -np     Do not force the execution of $HOME/.profile on remote
                 hosts
         -prefix Use the LAM installation in <lam/install/path/>
         -v      Be verbose
         -V      Print version and exit without shutting down LAM
         -w <#>  Lamwipe the first <#> nodes
         <bhost> Use <bhost> as the boot schema
------------------------------------------------------------------------ 
-----
lamboot did NOT complete successfully





I was able to get lam running fine interactively using ssh.  However,  
it looks like when running through SGE the ssi:boot process is having  
trouble calling back to the originating node.

I originally wanted to set this up using ssh, but I wasn't able to get  
as far as I am now using qrsh.  However, I'm not totally sure what's  
going on with qrsh.  I'm stuck.

Any thoughts/suggestions?

Thanks,

Jeff Deroshia


------------------------------------
Jeff Deroshia
Network Services Specialist
Department of Physics and Astronomy
The University of Georgia
706-542-3622
------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list