[GE users] Disappearing hosts/queues with PE's
jeroen.m.kleijer at philips.com
jeroen.m.kleijer at philips.com
Thu Feb 24 12:31:53 GMT 2005
I've been able to reproduce the problem.
Using the sge-lam script from Chris (without the added "open (DUMMY,
">/tmp/dummy");) I try to start a LAM universe, it fails and kills all
processes on the starting node except qrsh.
This qrsh process then starts claiming 100% CPU usage while doing nothing.
If I then look in qmon to see what queues I have, every queue that is
defined on this starting node (in my case nlcftcs12) has disappeared!
The job is still running according to qstat and qdel does kill the job but
the qrsh still keeps on running on the nlcftcs12 claiming CPU usage and
not returning the queue info in qmon.
When I kill the offending qrsh by hand the host shows up again in qmon.
(after about half a minute)
Should this be submitted to the dev-list?
Met vriendelijke groeten / Kind regards
Philips Applied Technologies
Reuti <reuti at staff.uni-marburg.de>
2005-02-24 12:56 PM
Please respond to users
To: users at gridengine.sunsource.net
cc: (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
Subject: Re: [GE users] Disappearing hosts/queues with PE's
Quoting jeroen.m.kleijer at philips.com:
> Doing a "qsub -pe lammpi-32bits 5 -V -cwd mpiexec C `pwd`/hello" (where
BTW: what is the maximum size for -V to be handled? Seems that in 5.3p6
is a 10 kB limit. As `set | wc -c` shows more, it will stop in the middle
give an error, and so we use it only inside the PE for qrsh and never on
> hello is a very simple mpi program) gives me the following:
> The PE lammpi-32bits is started and on the first host I can see a lamd,
> qrsh, qrsh_starter and lamhalt running.
> These processes keep running until I kill the job and therefore the LAM
> universe is never properly started. (see output at the end)
> What does cause me some concern is that in some cases (not replicable) I
> see missing queues or better yet hosts.
> When I open up qmon en try to see the state of the different queues I
> a couple of batch.q queues on several hosts.
> Killing the job usually makes them show up again though they return in
> error state:
> Queue: batch.q at nlcftcs12
> queue batch.q at nlcftcs12 marked QERROR as a result of 2717's
> failure at host nlcftcs12.
> Though I'm not particularly fond of a queue in an error state, the queue
> completely disappearing and reappearing when the job is killed leaves me
> bit puzzled.
> I can't seem to find anything related in the local messages file of
During my tests I only got the error state, but never a queue vanished.
> Perhaps I'm doing something very wrong with starting up the LAM universe
> so I'm eagerly awaiting Reuti's howto's / hints regarding tight
> integration of SGE+LAM.
In short: it's a race condition, that qrsh on the slaves is killed, before
is up. You'll see in the Howto how to get around it.
> Has anybody seen this before? (major problem is that it isn't
> And if anyone knows what to do about the lam problem seen in the output
> "ksh: ksh: -: unknown option" I'd be happy to hear about it.
LAM is sending the option -n to the script, and so the script sends it to
which doesn't know it and forward it to the shell:
$ ksh -c "-n ls"
ksh: ksh: - : unknown option
I found more than one version of Chris' original script, some with and
without the removal of this parameter. The rsh-wrapper e.g. in
takes care of it.
Cheers - Reuti
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users