[GE users] Disappearing hosts/queues with PE's

Thu Feb 24 12:31:53 GMT 2005

Hi Reuti,

I've been able to reproduce the problem.
Using the sge-lam script from Chris (without the added "open (DUMMY, 
">/tmp/dummy");) I try to start a LAM universe, it fails and kills all 
processes on the starting node except qrsh. 
This qrsh process then starts claiming 100% CPU usage while doing nothing.
If I then look in qmon to see what queues I have, every queue that is 
defined on this starting node (in my case  nlcftcs12) has disappeared!
The job is still running according to qstat and qdel does kill the job but 
the qrsh still keeps on running on the nlcftcs12 claiming CPU usage and 
not returning the queue info in qmon.
When I kill the offending qrsh by hand the host shows up again in qmon. 
(after about half a minute)
Should this be submitted to the dev-list?

Quoting jeroen.m.kleijer at philips.com:

> Doing a "qsub -pe lammpi-32bits 5 -V -cwd mpiexec C `pwd`/hello" (where 

BTW: what is the maximum size for -V to be handled? Seems that in 5.3p6 
is a 10 kB limit. As `set | wc -c` shows more, it will stop in the middle 
give an error, and so we use it only inside the PE for qrsh and never on 
command line.

> hello is a very simple mpi program) gives me the following:
> The PE lammpi-32bits is started and on the first host I can see a lamd, 
> qrsh, qrsh_starter and lamhalt running.
> These processes keep running until I kill the job and therefore the LAM 
> universe is never properly started. (see output at the end)
> What does cause me some concern is that in some cases (not replicable) I 

> see missing queues or better yet hosts.
> When I open up qmon en try to see the state of the different queues I 
> a couple of batch.q queues on several hosts.
> Killing the job usually makes them show up again though they return in 
> error state:
> Queue: batch.q at nlcftcs12
>         queue batch.q at nlcftcs12 marked QERROR as a result of 2717's 
> failure at host nlcftcs12.
> ++++++++++++++++++++++++++++++++++++++++
> Though I'm not particularly fond of a queue in an error state, the queue 

> completely disappearing and reappearing when the job is killed leaves me 
> bit puzzled.
> I can't seem to find anything related in the local messages file of 
> nlcftcs12.

During my tests I only got the error state, but never a queue vanished.
> Perhaps I'm doing something very wrong with starting up the LAM universe 

> so I'm eagerly awaiting Reuti's howto's / hints regarding tight 
> integration of SGE+LAM.

In short: it's a race condition, that qrsh on the slaves is killed, before 
is up. You'll see in the Howto how to get around it.
> Has anybody seen this before? (major problem is that it isn't 
> And if anyone knows what to do about the lam problem seen in the output 
> "ksh: ksh: -: unknown option" I'd be happy to hear about it.

LAM is sending the option -n to the script, and so the script sends it to 
which doesn't know it and forward it to the shell:

$ ksh -c "-n ls"
ksh: ksh: - : unknown option

I found more than one version of Chris' original script, some with and 
without the removal of this parameter. The rsh-wrapper e.g. in 
takes care of it.

Cheers - Reuti

