[GE users] Disappearing hosts/queues with PE's

jeroen.m.kleijer at philips.com jeroen.m.kleijer at philips.com
Thu Feb 24 10:57:15 GMT 2005


Hi all,

I'm trying to get a tight LAM integration going according to post:
http://gridengine.sunsource.net/servlets/ReadMsg?msgId=21121&listName=users

I used the sge-lam perl script provided in post:
http://gridengine.sunsource.net/servlets/ReadMsg?msgId=19278&listName=users

and modified the qrsh-local sub to have an open filedescriptor before 
doing the exec($qrsh, at myargs) so my CPU doesn't go to 100% while doing 
nothin.

This however, doesn't seem to be enough.

Having the following pe:
pe_name           lammpi-32bits
slots             6
user_lists        NONE
xuser_lists       NONE
start_proc_args   /cadappl/lam/7.1.1-32/bin/sge-lam start
stop_proc_args    /cadappl/lam/7.1.1-32/bin/sge-lam stop
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Doing a "qsub -pe lammpi-32bits 5 -V -cwd mpiexec C `pwd`/hello" (where 
hello is a very simple mpi program) gives me the following:
The PE lammpi-32bits is started and on the first host I can see a lamd, 
qrsh, qrsh_starter and lamhalt running.
These processes keep running until I kill the job and therefore the LAM 
universe is never properly started. (see output at the end)
What does cause me some concern is that in some cases (not replicable) I 
see missing queues or better yet hosts.
When I open up qmon en try to see the state of the different queues I miss 
a couple of batch.q queues on several hosts.
Killing the job usually makes them show up again though they return in an 
error state:
++++++++++++++++++++++++++++++++++++++++
Queue: batch.q at nlcftcs12
        queue batch.q at nlcftcs12 marked QERROR as a result of 2717's 
failure at host nlcftcs12.
++++++++++++++++++++++++++++++++++++++++

Though I'm not particularly fond of a queue in an error state, the queue 
completely disappearing and reappearing when the job is killed leaves me a 
bit puzzled.
I can't seem to find anything related in the local messages file of 
nlcftcs12.

Perhaps I'm doing something very wrong with starting up the LAM universe 
so I'm eagerly awaiting Reuti's howto's / hints regarding tight 
integration of SGE+LAM. 

Has anybody seen this before? (major problem is that it isn't replicable)
And if anyone knows what to do about the lam problem seen in the output 
"ksh: ksh: -: unknown option" I'd be happy to hear about it.

Kind regards,

Jeroen Kleijer



cat mpirun.pe2717
n-1<13549> ssi:boot:base:linear: booting n0 (nlcftcs12)
n-1<13549> ssi:boot:base:linear: booting n1 (nlcftcs13)
ERROR: LAM/MPI unexpectedly received the following on stderr:
ksh: ksh: - : unknown option
-----------------------------------------------------------------------------
LAM failed to execute a process on the remote node "nlcftcs13".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host. 

LAM tried to use the remote agent command 
"/cadappl/lam/7.1.1-32/bin/sge-lam" 
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This usually indicates an authentication problem with the remote
agent, some other configuration type of error in your .cshrc or
.profile file, or you were unable to executable a command on the
remote node for some other reason.  The following is a list of items
that you should check on the remote node:

        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755) 
        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644) 
        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
        - Your .cshrc/.profile must not print anything out to the 
          standard error
        - Your .cshrc/.profile should set a correct TERM type
        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell

Try invoking the following command at the unix command line:

        /cadappl/lam/7.1.1-32/bin/sge-lam qrsh-remote nlcftcs13 -n 'echo 
$SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<13549> ssi:boot:base:linear: Failed to boot n1 (nlcftcs13)
n-1<13549> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
Synopsis:       lamwipe [-d] [-h] [-H] [-v] [-V] [-nn] [-np] 
                        [-prefix </lam/install/path/>] [-w <#>] [<bhost>]

Description:    This command has been obsoleted by the "lamhalt" command.
                You should be using that instead.  However, "lamwipe" can
                still be used to shut down a LAM universe.

Options:
        -b      Use the faster lamwipe algorithm; will only work if shell
                on all remote nodes is same as shell on local node
        -d      Print debugging message (implies -v)
        -h      Print this message
        -H      Don't print the header
        -nn     Don't add "-n" to the remote agent command line
        -np     Do not force the execution of $HOME/.profile on remote
                hosts
        -prefix Use the LAM installation in <lam/install/path/>
        -v      Be verbose
        -V      Print version and exit without shutting down LAM
        -w <#>  Lamwipe the first <#> nodes
        <bhost> Use <bhost> as the boot schema
-----------------------------------------------------------------------------
lamboot did NOT complete successfully
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host nlcftcs12.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "lamhalt" command.

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------



Met vriendelijke groeten / Kind regards

Jeroen Kleijer
Unix Systeembeheer
Philips Applied Technologies



More information about the gridengine-users mailing list