[GE users] Problems with LAM tight integration

slaton slaton at berkeley.edu
Fri Aug 4 20:36:26 BST 2006


> Okay, please add the -v -d to lamboot, maybe we see something there.

OK. From the (pe) file:

n-1<1731> ssi:boot:open: opening
n-1<1731> ssi:boot:open: opening boot module globus
n-1<1731> ssi:boot:open: opened boot module globus
n-1<1731> ssi:boot:open: opening boot module rsh
n-1<1731> ssi:boot:open: opened boot module rsh
n-1<1731> ssi:boot:open: opening boot module slurm
n-1<1731> ssi:boot:open: opened boot module slurm
n-1<1731> ssi:boot:select: initializing boot module slurm
n-1<1731> ssi:boot:slurm: not running under SLURM
n-1<1731> ssi:boot:select: boot module not available: slurm
n-1<1731> ssi:boot:select: initializing boot module rsh
n-1<1731> ssi:boot:rsh: module initializing
n-1<1731> ssi:boot:rsh:agent: rsh
n-1<1731> ssi:boot:rsh:username: <same>
n-1<1731> ssi:boot:rsh:verbose: 1000
n-1<1731> ssi:boot:rsh:algorithm: linear
n-1<1731> ssi:boot:rsh:no_n: 0
n-1<1731> ssi:boot:rsh:no_profile: 0
n-1<1731> ssi:boot:rsh:fast: 0
n-1<1731> ssi:boot:rsh:ignore_stderr: 0
n-1<1731> ssi:boot:rsh:priority: 10
n-1<1731> ssi:boot:select: boot module available: rsh, priority: 10
n-1<1731> ssi:boot:select: initializing boot module globus
n-1<1731> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<1731> ssi:boot:select: boot module not available: globus
n-1<1731> ssi:boot:select: finalizing boot module slurm
n-1<1731> ssi:boot:slurm: finalizing
n-1<1731> ssi:boot:select: closing boot module slurm
n-1<1731> ssi:boot:select: finalizing boot module globus
n-1<1731> ssi:boot:globus: finalizing
n-1<1731> ssi:boot:select: closing boot module globus
n-1<1731> ssi:boot:select: selected boot module rsh
n-1<1731> ssi:boot:base: looking for boot schema in following directories:
n-1<1731> ssi:boot:base:   <current directory>
n-1<1731> ssi:boot:base:   $TROLLIUSHOME/etc
n-1<1731> ssi:boot:base:   $LAMHOME/etc
n-1<1731> ssi:boot:base:   /usr/local/lam/7.1.2/sge/pgi/etc
n-1<1731> ssi:boot:base: looking for boot schema file:
n-1<1731> ssi:boot:base:   /tmp/110.1.testing/machines
n-1<1731> ssi:boot:base: found boot schema: /tmp/110.1.testing/machines
n-1<1731> ssi:boot:rsh: found the following hosts:
n-1<1731> ssi:boot:rsh:   n0 qcn16 (cpu=1)
n-1<1731> ssi:boot:rsh:   n1 qcn17 (cpu=1)
n-1<1731> ssi:boot:rsh: resolved hosts:
n-1<1731> ssi:boot:rsh:   n0 qcn16 --> 10.0.1.16 (origin)
n-1<1731> ssi:boot:rsh:   n1 qcn17 --> 10.0.1.17
n-1<1731> ssi:boot:rsh: starting RTE procs
n-1<1731> ssi:boot:base:linear: starting
n-1<1731> ssi:boot:base:server: opening server TCP socket
n-1<1731> ssi:boot:base:server: opened port 32771
n-1<1731> ssi:boot:base:linear: booting n0 (qcn16)
n-1<1731> ssi:boot:rsh: starting lamd on (qcn16)
n-1<1731> ssi:boot:rsh: starting on n0 (qcn16): hboot -t -c lam-conf.lamd -d -v -sessionsuffix sge-110-undefined -I -H 10.0.1.16 -P 32771 -n 0 -o 0
n-1<1731> ssi:boot:rsh: launching locally
n-1<1731> ssi:boot:rsh: successfully launched on n0 (qcn16)
n-1<1731> ssi:boot:base:server: expecting connection from finite list
error: ERROR! invalid option argument "-n"
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
[snip]

Curious that it nevver attempts to start lamd on n1 (qcn17). Maybe because 
it doesn't get successful callback from qcn16.



More information about the gridengine-users mailing list