[GE users] Disappearing hosts/queues with PE's

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Thu Feb 24 13:33:04 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

jeroen.m.kleijer at philips.com wrote:

>
> Hi Stephan,
>
> It is u3 which I'm running at the moment.
> By maintrunk, do you mean the latest and greatest cvs version?

Hm.. latest yes, but greatest... that's what we hoppe... :-)
Cheers,
Stephan

> Because if it is I'll have to wait till tonight so I can download the 
> -current (cvs) version. (corporate firewall is in the way at the moment)
>
> Met vriendelijke groeten / Kind regards
>
> Jeroen Kleijer
> Unix Systeembeheer
> Philips Applied Technologies
>
>
>
> 	
>
>
>
>
> *Stephan Grell - Sun Germany - SSG - Software Engineer 
> <stephan.grell at sun.com>*
>
> 2005-02-24 01:39 PM
> Please respond to users
>
> 	       
>         To:        users at gridengine.sunsource.net
>         cc:        (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
>         Subject:        Re: [GE users] Disappearing hosts/queues with 
> PE's
>
>         Classification:        
>
>
>
>
> Hi Jeroen,
>
> that sounds more as if you found a bug. Which version are you using? u3?
> There have been quite some changes in the pe corner of  SGE. Could
> you test this with the current maintrunk? It might be fixed.... in the
> current code. But I am not sure....
>
> If it still happens, can you assemle all your data and submit an issue? At
> least to me it sounds, as if you found a bug.
>
> Cheers,
> Stephan
>
>
>
> jeroen.m.kleijer at philips.com wrote:
>
> >
> > Hi Reuti,
> >
> > I've been able to reproduce the problem.
> > Using the sge-lam script from Chris (without the added "open (DUMMY,
> > ">/tmp/dummy");) I try to start a LAM universe, it fails and kills all
> > processes on the starting node except qrsh.
> > This qrsh process then starts claiming 100% CPU usage while doing
> > nothing.
> > If I then look in qmon to see what queues I have, every queue that is
> > defined on this starting node (in my case  nlcftcs12) has disappeared!
> > The job is still running according to qstat and qdel does kill the job
> > but the qrsh still keeps on running on the nlcftcs12 claiming CPU
> > usage and not returning the queue info in qmon.
> > When I kill the offending qrsh by hand the host shows up again in
> > qmon. (after about half a minute)
> > Should this be submitted to the dev-list?
> >
> > Met vriendelijke groeten / Kind regards
> >
> > Jeroen Kleijer
> > Unix Systeembeheer
> > Philips Applied Technologies
> >
> >
> >
> >                  
> >
> >
> >
> >
> > *Reuti <reuti at staff.uni-marburg.de>*
> >
> > 2005-02-24 12:56 PM
> > Please respond to users
> >
> >                        
> >         To:        users at gridengine.sunsource.net
> >         cc:        (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
> >         Subject:        Re: [GE users] Disappearing hosts/queues with
> > PE's
> >
> >         Classification:        
> >
> >
> >
> >
> > Quoting jeroen.m.kleijer at philips.com:
> >
> > <snip>
> > > Doing a "qsub -pe lammpi-32bits 5 -V -cwd mpiexec C `pwd`/hello" 
> (where
> >
> > BTW: what is the maximum size for -V to be handled? Seems that in
> > 5.3p6 there
> > is a 10 kB limit. As `set | wc -c` shows more, it will stop in the
> > middle and
> > give an error, and so we use it only inside the PE for qrsh and never
> > on the
> > command line.
> >
> > > hello is a very simple mpi program) gives me the following:
> > > The PE lammpi-32bits is started and on the first host I can see a 
> lamd,
> > > qrsh, qrsh_starter and lamhalt running.
> > > These processes keep running until I kill the job and therefore 
> the LAM
> > > universe is never properly started. (see output at the end)
> > > What does cause me some concern is that in some cases (not
> > replicable) I
> > > see missing queues or better yet hosts.
> > > When I open up qmon en try to see the state of the different queues
> > I miss
> > > a couple of batch.q queues on several hosts.
> > > Killing the job usually makes them show up again though they return
> > in an
> > > error state:
> > > ++++++++++++++++++++++++++++++++++++++++
> > > Queue: batch.q at nlcftcs12
> > >         queue batch.q at nlcftcs12 marked QERROR as a result of 2717's
> > > failure at host nlcftcs12.
> > > ++++++++++++++++++++++++++++++++++++++++
> > >
> > > Though I'm not particularly fond of a queue in an error state, the
> > queue
> > > completely disappearing and reappearing when the job is killed
> > leaves me a
> > > bit puzzled.
> > > I can't seem to find anything related in the local messages file of
> > > nlcftcs12.
> >
> > During my tests I only got the error state, but never a queue vanished.
> >
> > > Perhaps I'm doing something very wrong with starting up the LAM
> > universe
> > > so I'm eagerly awaiting Reuti's howto's / hints regarding tight
> > > integration of SGE+LAM.
> >
> > In short: it's a race condition, that qrsh on the slaves is killed,
> > before lamd
> > is up. You'll see in the Howto how to get around it.
> >
> > > Has anybody seen this before? (major problem is that it isn't
> > replicable)
> > > And if anyone knows what to do about the lam problem seen in the 
> output
> > > "ksh: ksh: -: unknown option" I'd be happy to hear about it.
> >
> > LAM is sending the option -n to the script, and so the script sends it
> > to qrsh,
> > which doesn't know it and forward it to the shell:
> >
> > $ ksh -c "-n ls"
> > ksh: ksh: - : unknown option
> >
> > I found more than one version of Chris' original script, some with and
> > some
> > without the removal of this parameter. The rsh-wrapper e.g. in
> > $SGE_ROOT/mpi
> > takes care of it.
> >
> > Cheers - Reuti
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list