[GE users] Disappearing hosts/queues with PE's

Reuti reuti at staff.uni-marburg.de
Thu Feb 24 11:56:25 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Quoting jeroen.m.kleijer at philips.com:

<snip>
> Doing a "qsub -pe lammpi-32bits 5 -V -cwd mpiexec C `pwd`/hello" (where 

BTW: what is the maximum size for -V to be handled? Seems that in 5.3p6 there 
is a 10 kB limit. As `set | wc -c` shows more, it will stop in the middle and 
give an error, and so we use it only inside the PE for qrsh and never on the 
command line.

> hello is a very simple mpi program) gives me the following:
> The PE lammpi-32bits is started and on the first host I can see a lamd, 
> qrsh, qrsh_starter and lamhalt running.
> These processes keep running until I kill the job and therefore the LAM 
> universe is never properly started. (see output at the end)
> What does cause me some concern is that in some cases (not replicable) I 
> see missing queues or better yet hosts.
> When I open up qmon en try to see the state of the different queues I miss 
> a couple of batch.q queues on several hosts.
> Killing the job usually makes them show up again though they return in an 
> error state:
> ++++++++++++++++++++++++++++++++++++++++
> Queue: batch.q at nlcftcs12
>         queue batch.q at nlcftcs12 marked QERROR as a result of 2717's 
> failure at host nlcftcs12.
> ++++++++++++++++++++++++++++++++++++++++
> 
> Though I'm not particularly fond of a queue in an error state, the queue 
> completely disappearing and reappearing when the job is killed leaves me a 
> bit puzzled.
> I can't seem to find anything related in the local messages file of 
> nlcftcs12.

During my tests I only got the error state, but never a queue vanished.
 
> Perhaps I'm doing something very wrong with starting up the LAM universe 
> so I'm eagerly awaiting Reuti's howto's / hints regarding tight 
> integration of SGE+LAM.

In short: it's a race condition, that qrsh on the slaves is killed, before lamd 
is up. You'll see in the Howto how to get around it.
 
> Has anybody seen this before? (major problem is that it isn't replicable)
> And if anyone knows what to do about the lam problem seen in the output 
> "ksh: ksh: -: unknown option" I'd be happy to hear about it.

LAM is sending the option -n to the script, and so the script sends it to qrsh, 
which doesn't know it and forward it to the shell:

$ ksh -c "-n ls"
ksh: ksh: - : unknown option

I found more than one version of Chris' original script, some with and some 
without the removal of this parameter. The rsh-wrapper e.g. in $SGE_ROOT/mpi 
takes care of it.

Cheers - Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list