[GE users] Disappearing hosts/queues with PE's

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Thu Feb 24 12:39:23 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Jeroen,

that sounds more as if you found a bug. Which version are you using? u3?
There have been quite some changes in the pe corner of  SGE. Could
you test this with the current maintrunk? It might be fixed.... in the
current code. But I am not sure....

If it still happens, can you assemle all your data and submit an issue? At
least to me it sounds, as if you found a bug.

Cheers,
Stephan



jeroen.m.kleijer at philips.com wrote:

>
> Hi Reuti,
>
> I've been able to reproduce the problem.
> Using the sge-lam script from Chris (without the added "open (DUMMY, 
> ">/tmp/dummy");) I try to start a LAM universe, it fails and kills all 
> processes on the starting node except qrsh.
> This qrsh process then starts claiming 100% CPU usage while doing 
> nothing.
> If I then look in qmon to see what queues I have, every queue that is 
> defined on this starting node (in my case  nlcftcs12) has disappeared!
> The job is still running according to qstat and qdel does kill the job 
> but the qrsh still keeps on running on the nlcftcs12 claiming CPU 
> usage and not returning the queue info in qmon.
> When I kill the offending qrsh by hand the host shows up again in 
> qmon. (after about half a minute)
> Should this be submitted to the dev-list?
>
> Met vriendelijke groeten / Kind regards
>
> Jeroen Kleijer
> Unix Systeembeheer
> Philips Applied Technologies
>
>
>
> 	
>
>
>
>
> *Reuti <reuti at staff.uni-marburg.de>*
>
> 2005-02-24 12:56 PM
> Please respond to users
>
> 	       
>         To:        users at gridengine.sunsource.net
>         cc:        (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
>         Subject:        Re: [GE users] Disappearing hosts/queues with 
> PE's
>
>         Classification:        
>
>
>
>
> Quoting jeroen.m.kleijer at philips.com:
>
> <snip>
> > Doing a "qsub -pe lammpi-32bits 5 -V -cwd mpiexec C `pwd`/hello" (where
>
> BTW: what is the maximum size for -V to be handled? Seems that in 
> 5.3p6 there
> is a 10 kB limit. As `set | wc -c` shows more, it will stop in the 
> middle and
> give an error, and so we use it only inside the PE for qrsh and never 
> on the
> command line.
>
> > hello is a very simple mpi program) gives me the following:
> > The PE lammpi-32bits is started and on the first host I can see a lamd,
> > qrsh, qrsh_starter and lamhalt running.
> > These processes keep running until I kill the job and therefore the LAM
> > universe is never properly started. (see output at the end)
> > What does cause me some concern is that in some cases (not 
> replicable) I
> > see missing queues or better yet hosts.
> > When I open up qmon en try to see the state of the different queues 
> I miss
> > a couple of batch.q queues on several hosts.
> > Killing the job usually makes them show up again though they return 
> in an
> > error state:
> > ++++++++++++++++++++++++++++++++++++++++
> > Queue: batch.q at nlcftcs12
> >         queue batch.q at nlcftcs12 marked QERROR as a result of 2717's
> > failure at host nlcftcs12.
> > ++++++++++++++++++++++++++++++++++++++++
> >
> > Though I'm not particularly fond of a queue in an error state, the 
> queue
> > completely disappearing and reappearing when the job is killed 
> leaves me a
> > bit puzzled.
> > I can't seem to find anything related in the local messages file of
> > nlcftcs12.
>
> During my tests I only got the error state, but never a queue vanished.
>
> > Perhaps I'm doing something very wrong with starting up the LAM 
> universe
> > so I'm eagerly awaiting Reuti's howto's / hints regarding tight
> > integration of SGE+LAM.
>
> In short: it's a race condition, that qrsh on the slaves is killed, 
> before lamd
> is up. You'll see in the Howto how to get around it.
>
> > Has anybody seen this before? (major problem is that it isn't 
> replicable)
> > And if anyone knows what to do about the lam problem seen in the output
> > "ksh: ksh: -: unknown option" I'd be happy to hear about it.
>
> LAM is sending the option -n to the script, and so the script sends it 
> to qrsh,
> which doesn't know it and forward it to the shell:
>
> $ ksh -c "-n ls"
> ksh: ksh: - : unknown option
>
> I found more than one version of Chris' original script, some with and 
> some
> without the removal of this parameter. The rsh-wrapper e.g. in 
> $SGE_ROOT/mpi
> takes care of it.
>
> Cheers - Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list