[GE users] Workaround for bug 2890

abercromby basingwerk at talk21.com
Fri Feb 20 18:04:51 GMT 2009


It's a test set-up that I've just made. One qmaster etc.,
with one execd node. I've installed the software as per the
installation guide, yet the scheduler (thread?) 
periodically drops out. It doesn't even last ten minutes. 
Just a few minutes, really. The jobs back up in "qw".
I am using this perl script to "fix" things right now:

while(1) {
  print("Monitoring ... \n");
  open(QCONFSECL,"qconf -secl 2>&1 |") or die("failed to read qconf -secl");

  my $droppedOut = 1;
  while(<QCONFSECL>) {

    if ($_ =~ /scheduler/) {
      $droppedOut = 0;
    }
  }
  close(QCONFSECL);

  if ($droppedOut) {
    print("Attempting a scheduler thread cycle\n");
    # Get rid of the scheduler
    system("qconf -kt scheduler");

    my $success = 0;
    while (! $success ) {
      sleep 5;
      open(QCONFAT ,"qconf -at scheduler 2>&1 |") or die("failed to read qconf -at");
      while (<QCONFAT>) {
        if ($_ =~ /scheduler has been started/) {
          print("Done scheduler thread cycle\n");
          $success = 1;
        }
      }
      close(QCONFAT);
      if ($success) {
        break;
      }
    }
  }
  else {
    sleep 5;
  }
}




--- On Fri, 20/2/09, reuti <reuti at staff.uni-marburg.de> wrote:

> From: reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [GE users] Workaround for bug 2890
> To: users at gridengine.sunsource.net
> Date: Friday, 20 February, 2009, 5:55 PM
> Am 20.02.2009 um 17:47 schrieb abercromby:
> 
> > I've just installed 6.2u1, but it fails after 10
> minutes or so. The  
> > mode of failure is that the qmaster unsubcribes the
> scheduler:
> 
> This issue means, that one run of the scheduler takes more
> than 10  
> minutes? How many jobs do you have in the system?
> 
> -- Reuti
> 
> 
> > BEFORE:
> > # qconf -secl; qstat
> >       ID NAME            HOST
> > --------------------------------------------------
> >        1 scheduler       r178-n51.ph.liv.ac.uk
> >
> > AFTER:
> > # qconf -secl; qstat
> > no event clients registered
> >
> > After that, all jobs stay in qw, until I restart
> everything.
> > The issue is described in 2890, but no workaround is
> given.
> > Does anyone know how to get around this. Right now,
> it's a
> > showstopper.
> >
> > Steve
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do? 
> > dsForumId=38&dsMessageId=110684
> >
> > To unsubscribe from this discussion, e-mail: [users- 
> > unsubscribe at gridengine.sunsource.net].
> >
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=110725
> 
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=110729

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list