[GE users] Strange LAM integration problem and qrsh shell question

Tim Mueller tim_mueller at hotmail.com
Thu Sep 23 18:20:53 BST 2004

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


First, my simple problem:  When I use qrsh from my account, it tries to execute /root/.bashrc on the remote machine.  This causes file permission problems, as I do not have permission to execute /root/.bashrc.  How can I stop this from happening without changing the permissions on /root/.bashrc?

Now for the more complicated problem.  For this problem I've temporarily changed file permissions on /root/.bashrc so that is no longer an issue.  I'm using the LAM integration script posted to this list (http://gridengine.sunsource.net/servlets/ReadMsg?msgId=19278&listName=users) to integrate the pre-compiled SGE 6.0u1 with LAM 7.06.  I'm using PERL 5.8.  This is on an Opteron machine running RedHat Scientific Linux SL Release 3.0.2 (kernel 2.4.21-15.0.2.ELsmp).  

LAM works on this machine, as does Grid Engine.  However, when I use the integration script, I get the debug output shown at the end of this message.  Lamboot apparently never hears back from the remote lamd agent.  I end up with the following process consuming a CPU:

/usr/local/sge-6.0u1/bin/lx24-amd64/qrsh -inherit -nostdin -V local27.X.X.X /usr/local/lam-7.0.6-path/bin/lamd -H -P 43648 -n 0 -o 0 -sessionsuffix sge-43-undefined

I need to kill this process by hand.

As far as I can tell, the remote lamd never even starts.  I've tried various solutions posted to this list and the LAM list for similar problems:

-- Using ssh instead of qrsh_remote does not change things
-- Changing the number of slots available (using 2 instead of 1) makes no difference
-- The network connection between the two machines seems to be fine, and no firewalls are in place.

My PE is configured as follows:

pe_name           mpi
slots             16
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/local/lam-7.0.6-path/bin/sge-lam start
stop_proc_args    /usr/local/lam-7.0.6-path/bin/sge-lam stop
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     avg

Now for the weird bit.  I can get the script to work fine as long as I have an open filehandle before the call following call in qrsh_local:

exec($qrsh, at myargs);

For example, if I replace that call with 

open(DUMMY,"> /tmp/dummy");
exec($qrsh, at myargs);

everything works perfectly.  I could leave it like this, but I'd rather know what's going on.  Replacingt exec() with system() gives similar results, except that of course when qrsh hangs sge-lam does as well.

I'm very confused by this.  Any suggestions would be helpful.


SGE-LAM Debug output (for lam start):

SGE-LAM DEBUG: LAMHOME = /usr/local/lam-7.0.6-path
SGE-LAM DEBUG: SGE_ROOT = /usr/local/sge-6.0u1
SGE-LAM DEBUG: PATH = /usr/local/sge-6.0u1/bin/lx24-amd64:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/pathscale-1.3/bin/:/usr/local/lam-7.0.6-path/bin:/usr/X11R6/bin:/usr/local/OpenPBS-2.3
SGE-LAM DEBUG: qrsh = /usr/local/sge-6.0u1/bin/lx24-amd64/qrsh
SGE-LAM DEBUG: func=start
SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi boot_rsh_agent /usr/local/lam-7.0.6-path/bin/sge-lam qrsh-remote -c sge-lam-conf.lamd -v -d /tmp/43.1.opteron_test/lamhostfile /tmp/43.1.opteron_
n-1<30315> ssi:boot: Opening
n-1<30315> ssi:boot: looking for module named rsh
n-1<30315> ssi:boot: opening module rsh
n-1<30315> ssi:boot: initializing module rsh
n-1<30315> ssi:boot:rsh: module initializing
n-1<30315> ssi:boot:rsh:agent: /usr/local/lam-7.0.6-path/bin/sge-lam qrsh-remote
n-1<30315> ssi:boot:rsh:username: <same>
n-1<30315> ssi:boot:rsh:verbose: 1000
n-1<30315> ssi:boot:rsh:algorithm: linear
n-1<30315> ssi:boot:rsh:priority: 10
n-1<30315> ssi:boot: Selected boot module rsh
n-1<30315> ssi:boot:base: looking for boot schema in following directories:
n-1<30315> ssi:boot:base:   <current directory>
n-1<30315> ssi:boot:base:   $TROLLIUSHOME/etc
n-1<30315> ssi:boot:base:   $LAMHOME/etc
n-1<30315> ssi:boot:base:   /usr/local/lam-7.0.6-path//etc
n-1<30315> ssi:boot:base: looking for boot schema file:
n-1<30315> ssi:boot:base:   /tmp/43.1.opteron_test/lamhostfile
n-1<30315> ssi:boot:base: found boot schema: /tmp/43.1.opteron_test/lamhostfile
n-1<30315> ssi:boot:rsh: found the following hosts:
n-1<30315> ssi:boot:rsh:   n0 local27.X.X.X (cpu=2)
n-1<30315> ssi:boot:rsh: resolved hosts:
n-1<30315> ssi:boot:rsh:   n0 local27.X.X.X--> (origin)
n-1<30315> ssi:boot:rsh: starting RTE procs
n-1<30315> ssi:boot:base:linear: starting
n-1<30315> ssi:boot:base:server: opening server TCP socket
n-1<30315> ssi:boot:base:server: opened port 41734
n-1<30315> ssi:boot:base:linear: booting n0 (local27.X.X.X)
n-1<30315> ssi:boot:rsh: starting lamd on (local27.X.X.X)
n-1<30315> ssi:boot:rsh: starting on n0 (local27.X.X.X): hboot -t -c sge-lam-conf.lamd -d -v -sessionsuffix sge-43-undefined -I -H -P 41734 -n 0 -o 0
n-1<30315> ssi:boot:rsh: launching locally
n-1<30315> ssi:boot:rsh: successfully launched on n0 (local27.X.X.X)
n-1<30315> ssi:boot:base:server: expecting connection from finite list
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

As far as LAM could tell, the remote process started properly, but
then never called back.  Possible reasons that this may happen:

        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random

2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line.  For example, 
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error.  If
   you get any other kind of error, it could indicate either of the
   two conditions above.  Consult with your system/network
n-1<30315> ssi:boot:base:server: failed to connect to remote lamd!
n-1<30315> ssi:boot:base:server: closing server socket
n-1<30315> ssi:boot:base:linear: aborted!
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
lamboot did NOT complete successfully

More information about the gridengine-users mailing list