[GE users] Workaround for bug 2890

reuti reuti at staff.uni-marburg.de
Fri Feb 20 18:25:31 GMT 2009


Am 20.02.2009 um 19:04 schrieb abercromby:

> It's a test set-up that I've just made. One qmaster etc.,
> with one execd node. I've installed the software as per the
> installation guide, yet the scheduler (thread?)
> periodically drops out. It doesn't even last ten minutes.
> Just a few minutes, really. The jobs back up in "qw".
> I am using this perl script to "fix" things right now:

Are there any error message in /tmp? Can you set the "loglevel" in  
the SGE's configuration to info? For me 6.2u1 is running fine. What  
is you scheduling interval set to?

-- Reuti


> while(1) {
>   print("Monitoring ... \n");
>   open(QCONFSECL,"qconf -secl 2>&1 |") or die("failed to read qconf  
> -secl");
>
>   my $droppedOut = 1;
>   while(<QCONFSECL>) {
>
>     if ($_ =~ /scheduler/) {
>       $droppedOut = 0;
>     }
>   }
>   close(QCONFSECL);
>
>   if ($droppedOut) {
>     print("Attempting a scheduler thread cycle\n");
>     # Get rid of the scheduler
>     system("qconf -kt scheduler");
>
>     my $success = 0;
>     while (! $success ) {
>       sleep 5;
>       open(QCONFAT ,"qconf -at scheduler 2>&1 |") or die("failed to  
> read qconf -at");
>       while (<QCONFAT>) {
>         if ($_ =~ /scheduler has been started/) {
>           print("Done scheduler thread cycle\n");
>           $success = 1;
>         }
>       }
>       close(QCONFAT);
>       if ($success) {
>         break;
>       }
>     }
>   }
>   else {
>     sleep 5;
>   }
> }
>
>
>
>
> --- On Fri, 20/2/09, reuti <reuti at staff.uni-marburg.de> wrote:
>
>> From: reuti <reuti at staff.uni-marburg.de>
>> Subject: Re: [GE users] Workaround for bug 2890
>> To: users at gridengine.sunsource.net
>> Date: Friday, 20 February, 2009, 5:55 PM
>> Am 20.02.2009 um 17:47 schrieb abercromby:
>>
>>> I've just installed 6.2u1, but it fails after 10
>> minutes or so. The
>>> mode of failure is that the qmaster unsubcribes the
>> scheduler:
>>
>> This issue means, that one run of the scheduler takes more
>> than 10
>> minutes? How many jobs do you have in the system?
>>
>> -- Reuti
>>
>>
>>> BEFORE:
>>> # qconf -secl; qstat
>>>       ID NAME            HOST
>>> --------------------------------------------------
>>>        1 scheduler       r178-n51.ph.liv.ac.uk
>>>
>>> AFTER:
>>> # qconf -secl; qstat
>>> no event clients registered
>>>
>>> After that, all jobs stay in qw, until I restart
>> everything.
>>> The issue is described in 2890, but no workaround is
>> given.
>>> Does anyone know how to get around this. Right now,
>> it's a
>>> showstopper.
>>>
>>> Steve
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=110684
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=110725
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=110729
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=110739

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list