[GE users] Disappearing hosts/queues with PE's

jeroen.m.kleijer at philips.com jeroen.m.kleijer at philips.com
Fri Feb 25 15:33:22 GMT 2005


Hi Stephan,

It took me a while to compile SGE cleanly (especially the bdb stuff which 
wouldn't work because it's run from an NFS directory.... stupid me, 
should've read the docs) but, the maintrunk has this problem fixed as far 
as I can tell. Hosts no longer disappear.

I've used the same setup with the sge-lam perl script (without the open 
filedescriptor in qrsh_local) and even though I still get a race condition 
in qrsh, it no longer makes my execution host dissapear. The qrsh process 
spawns lamboot (which fails) and after three minutes dies.

The job is then returned to a pending state, waiting to be rescheduled and 
leaves the queue in which it was previously started in an error state.

This leaves me with another question:

If the starting of a parallel environment manages to leave a queue in an 
error state the offending job will get placed back in a pending state and 
when resources are available it will go to another queue/host but most 
likely this will fail as well causing a loop which manages to leave every 
queue / hosts in an error state.

Is there anyway to prevent this?

Met vriendelijke groeten / Kind regards

Jeroen Kleijer
Unix Systeembeheer
Philips Applied Technologies









Stephan Grell - Sun Germany - SSG - Software Engineer 
<stephan.grell at sun.com>
2005-02-24 02:33 PM
Please respond to users
 
        To:     users at gridengine.sunsource.net
        cc:     (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
        Subject:        Re: [GE users] Disappearing hosts/queues with PE's
        Classification: 




jeroen.m.kleijer at philips.com wrote:

>
> Hi Stephan,
>
> It is u3 which I'm running at the moment.
> By maintrunk, do you mean the latest and greatest cvs version?

Hm.. latest yes, but greatest... that's what we hoppe... :-)
Cheers,
Stephan

> Because if it is I'll have to wait till tonight so I can download the 
> -current (cvs) version. (corporate firewall is in the way at the moment)
>
> Met vriendelijke groeten / Kind regards
>
> Jeroen Kleijer
> Unix Systeembeheer
> Philips Applied Technologies
>
>
>
> 
>
>
>
>
> *Stephan Grell - Sun Germany - SSG - Software Engineer 
> <stephan.grell at sun.com>*
>
> 2005-02-24 01:39 PM
> Please respond to users
>
> 
>         To:        users at gridengine.sunsource.net
>         cc:        (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
>         Subject:        Re: [GE users] Disappearing hosts/queues with 
> PE's
>
>         Classification: 
>
>
>
>
> Hi Jeroen,
>
> that sounds more as if you found a bug. Which version are you using? u3?
> There have been quite some changes in the pe corner of  SGE. Could
> you test this with the current maintrunk? It might be fixed.... in the
> current code. But I am not sure....
>
> If it still happens, can you assemle all your data and submit an issue? 
At
> least to me it sounds, as if you found a bug.
>
> Cheers,
> Stephan
>
>
>
> jeroen.m.kleijer at philips.com wrote:
>
> >
> > Hi Reuti,
> >
> > I've been able to reproduce the problem.
> > Using the sge-lam script from Chris (without the added "open (DUMMY,
> > ">/tmp/dummy");) I try to start a LAM universe, it fails and kills all
> > processes on the starting node except qrsh.
> > This qrsh process then starts claiming 100% CPU usage while doing
> > nothing.
> > If I then look in qmon to see what queues I have, every queue that is
> > defined on this starting node (in my case  nlcftcs12) has disappeared!
> > The job is still running according to qstat and qdel does kill the job
> > but the qrsh still keeps on running on the nlcftcs12 claiming CPU
> > usage and not returning the queue info in qmon.
> > When I kill the offending qrsh by hand the host shows up again in
> > qmon. (after about half a minute)
> > Should this be submitted to the dev-list?
> >
> > Met vriendelijke groeten / Kind regards
> >
> > Jeroen Kleijer
> > Unix Systeembeheer
> > Philips Applied Technologies
> >
> >
> >
> > 
> >
> >
> >
> >
> > *Reuti <reuti at staff.uni-marburg.de>*
> >
> > 2005-02-24 12:56 PM
> > Please respond to users
> >
> > 
> >         To:        users at gridengine.sunsource.net
> >         cc:        (bcc: Jeroen M. Kleijer/EHV/CFT/PHILIPS)
> >         Subject:        Re: [GE users] Disappearing hosts/queues with
> > PE's
> >
> >         Classification: 
> >
> >
> >
> >
> > Quoting jeroen.m.kleijer at philips.com:
> >
> > <snip>
> > > Doing a "qsub -pe lammpi-32bits 5 -V -cwd mpiexec C `pwd`/hello" 
> (where
> >
> > BTW: what is the maximum size for -V to be handled? Seems that in
> > 5.3p6 there
> > is a 10 kB limit. As `set | wc -c` shows more, it will stop in the
> > middle and
> > give an error, and so we use it only inside the PE for qrsh and never
> > on the
> > command line.
> >
> > > hello is a very simple mpi program) gives me the following:
> > > The PE lammpi-32bits is started and on the first host I can see a 
> lamd,
> > > qrsh, qrsh_starter and lamhalt running.
> > > These processes keep running until I kill the job and therefore 
> the LAM
> > > universe is never properly started. (see output at the end)
> > > What does cause me some concern is that in some cases (not
> > replicable) I
> > > see missing queues or better yet hosts.
> > > When I open up qmon en try to see the state of the different queues
> > I miss
> > > a couple of batch.q queues on several hosts.
> > > Killing the job usually makes them show up again though they return
> > in an
> > > error state:
> > > ++++++++++++++++++++++++++++++++++++++++
> > > Queue: batch.q at nlcftcs12
> > >         queue batch.q at nlcftcs12 marked QERROR as a result of 2717's
> > > failure at host nlcftcs12.
> > > ++++++++++++++++++++++++++++++++++++++++
> > >
> > > Though I'm not particularly fond of a queue in an error state, the
> > queue
> > > completely disappearing and reappearing when the job is killed
> > leaves me a
> > > bit puzzled.
> > > I can't seem to find anything related in the local messages file of
> > > nlcftcs12.
> >
> > During my tests I only got the error state, but never a queue 
vanished.
> >
> > > Perhaps I'm doing something very wrong with starting up the LAM
> > universe
> > > so I'm eagerly awaiting Reuti's howto's / hints regarding tight
> > > integration of SGE+LAM.
> >
> > In short: it's a race condition, that qrsh on the slaves is killed,
> > before lamd
> > is up. You'll see in the Howto how to get around it.
> >
> > > Has anybody seen this before? (major problem is that it isn't
> > replicable)
> > > And if anyone knows what to do about the lam problem seen in the 
> output
> > > "ksh: ksh: -: unknown option" I'd be happy to hear about it.
> >
> > LAM is sending the option -n to the script, and so the script sends it
> > to qrsh,
> > which doesn't know it and forward it to the shell:
> >
> > $ ksh -c "-n ls"
> > ksh: ksh: - : unknown option
> >
> > I found more than one version of Chris' original script, some with and
> > some
> > without the removal of this parameter. The rsh-wrapper e.g. in
> > $SGE_ROOT/mpi
> > takes care of it.
> >
> > Cheers - Reuti
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





More information about the gridengine-users mailing list