[GE users] Workaround for bug 2890

rayson rayrayson at gmail.com
Fri Feb 20 19:10:00 GMT 2009


With a hack as messy as this, isn't SGE 6.1 a better solution??

Rayson



On 2/20/09, abercromby <basingwerk at talk21.com> wrote:
> It's a test set-up that I've just made. One qmaster etc.,
> with one execd node. I've installed the software as per the
> installation guide, yet the scheduler (thread?)
> periodically drops out. It doesn't even last ten minutes.
> Just a few minutes, really. The jobs back up in "qw".
> I am using this perl script to "fix" things right now:
>
> while(1) {
>  print("Monitoring ... \n");
>  open(QCONFSECL,"qconf -secl 2>&1 |") or die("failed to read qconf -secl");
>
>  my $droppedOut = 1;
>  while(<QCONFSECL>) {
>
>    if ($_ =~ /scheduler/) {
>      $droppedOut = 0;
>    }
>  }
>  close(QCONFSECL);
>
>  if ($droppedOut) {
>    print("Attempting a scheduler thread cycle\n");
>    # Get rid of the scheduler
>    system("qconf -kt scheduler");
>
>    my $success = 0;
>    while (! $success ) {
>      sleep 5;
>      open(QCONFAT ,"qconf -at scheduler 2>&1 |") or die("failed to read qconf -at");
>      while (<QCONFAT>) {
>        if ($_ =~ /scheduler has been started/) {
>          print("Done scheduler thread cycle\n");
>          $success = 1;
>        }
>      }
>      close(QCONFAT);
>      if ($success) {
>        break;
>      }
>    }
>  }
>  else {
>    sleep 5;
>  }
> }
>
>
>
>
> --- On Fri, 20/2/09, reuti <reuti at staff.uni-marburg.de> wrote:
>
> > From: reuti <reuti at staff.uni-marburg.de>
> > Subject: Re: [GE users] Workaround for bug 2890
> > To: users at gridengine.sunsource.net
> > Date: Friday, 20 February, 2009, 5:55 PM
> > Am 20.02.2009 um 17:47 schrieb abercromby:
> >
> > > I've just installed 6.2u1, but it fails after 10
> > minutes or so. The
> > > mode of failure is that the qmaster unsubcribes the
> > scheduler:
> >
> > This issue means, that one run of the scheduler takes more
> > than 10
> > minutes? How many jobs do you have in the system?
> >
> > -- Reuti
> >
> >
> > > BEFORE:
> > > # qconf -secl; qstat
> > >       ID NAME            HOST
> > > --------------------------------------------------
> > >        1 scheduler       r178-n51.ph.liv.ac.uk
> > >
> > > AFTER:
> > > # qconf -secl; qstat
> > > no event clients registered
> > >
> > > After that, all jobs stay in qw, until I restart
> > everything.
> > > The issue is described in 2890, but no workaround is
> > given.
> > > Does anyone know how to get around this. Right now,
> > it's a
> > > showstopper.
> > >
> > > Steve
> > >
> > > ------------------------------------------------------
> > > http://gridengine.sunsource.net/ds/viewMessage.do?
> > > dsForumId=38&dsMessageId=110684
> > >
> > > To unsubscribe from this discussion, e-mail: [users-
> > > unsubscribe at gridengine.sunsource.net].
> > >
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=110725
> >
> > To unsubscribe from this discussion, e-mail:
> > [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=110729
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=110767

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list