[GE users] Disappearing hosts/queues with PE's
reuti at staff.uni-marburg.de
Thu Feb 24 11:56:25 GMT 2005
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Quoting jeroen.m.kleijer at philips.com:
> Doing a "qsub -pe lammpi-32bits 5 -V -cwd mpiexec C `pwd`/hello" (where
BTW: what is the maximum size for -V to be handled? Seems that in 5.3p6 there
is a 10 kB limit. As `set | wc -c` shows more, it will stop in the middle and
give an error, and so we use it only inside the PE for qrsh and never on the
> hello is a very simple mpi program) gives me the following:
> The PE lammpi-32bits is started and on the first host I can see a lamd,
> qrsh, qrsh_starter and lamhalt running.
> These processes keep running until I kill the job and therefore the LAM
> universe is never properly started. (see output at the end)
> What does cause me some concern is that in some cases (not replicable) I
> see missing queues or better yet hosts.
> When I open up qmon en try to see the state of the different queues I miss
> a couple of batch.q queues on several hosts.
> Killing the job usually makes them show up again though they return in an
> error state:
> Queue: batch.q at nlcftcs12
> queue batch.q at nlcftcs12 marked QERROR as a result of 2717's
> failure at host nlcftcs12.
> Though I'm not particularly fond of a queue in an error state, the queue
> completely disappearing and reappearing when the job is killed leaves me a
> bit puzzled.
> I can't seem to find anything related in the local messages file of
During my tests I only got the error state, but never a queue vanished.
> Perhaps I'm doing something very wrong with starting up the LAM universe
> so I'm eagerly awaiting Reuti's howto's / hints regarding tight
> integration of SGE+LAM.
In short: it's a race condition, that qrsh on the slaves is killed, before lamd
is up. You'll see in the Howto how to get around it.
> Has anybody seen this before? (major problem is that it isn't replicable)
> And if anyone knows what to do about the lam problem seen in the output
> "ksh: ksh: -: unknown option" I'd be happy to hear about it.
LAM is sending the option -n to the script, and so the script sends it to qrsh,
which doesn't know it and forward it to the shell:
$ ksh -c "-n ls"
ksh: ksh: - : unknown option
I found more than one version of Chris' original script, some with and some
without the removal of this parameter. The rsh-wrapper e.g. in $SGE_ROOT/mpi
takes care of it.
Cheers - Reuti
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users