[GE users] Fixing a broken Berkeley database?

Orlando Richards orlando.richards at ed.ac.uk
Wed Oct 22 11:10:26 BST 2008


Hi all,

I'm happy to report that we have repaired the database, by using 
"db_dump" followed by "db_load".

Full command list:

service sgemaster stop # on failover server
service sgemaster stop # on master server

cd $SGE_ROOT/default/spool
cp -a spooldb spooldb.bak

cd spooldb
$SGE_ROOT/utilbin/lx24-amd64/db_verify sge
$SGE_ROOT/utilbin/lx24-amd64/db_recover
$SGE_ROOT/utilbin/lx24-amd64/db_dump -f sge.out sge
mv sge sge.old
$SGE_ROOT/utilbin/lx24-amd64/db_load -f sge.out sge
$SGE_ROOT/utilbin/lx24-amd64/db_verify sge


service sgemaster start # on master server
service sgemaster start # on failover server


Orlando Richards wrote:
> Orlando Richards wrote:
>> skip at pobox.com wrote:
>>>     Orlando> I'm guessing that there is a corrupt entry in the 
>>> database - is
>>>     Orlando> there any simple way to repair it?
>>>
>>> If you installed Berkeley DB from source (or perhaps installed a dev 
>>> RPM on
>>> a Linux system) you should have a program called db_recover (or maybe
>>> dbXY_recover where X and Y are the major and minor version numbers).  If
>>> your database is corrupt that will quite possibly fix it.
>>>
>>> Longer term the more important question to answer is how it got 
>>> corrupted in
>>> the first place.  Make sure multi-program access to the database is 
>>> either
>>> mediated by a single program your clients communicate with or that your
>>> programs use some sort of file locking scheme to prevent multiple
>>> simultaneous accesses.
>>>
>>
>> Thanks for that Skip - I've installed the db4-utils package on our 
>> RedHat box that includes db_recover. However - we're not running a 
>> separate Berkeley DB server, but instead the file-based option that 
>> SGE has "bundled" up with it, so I'm a bit confused as to how to 
>> recover the database. From what I can tell, db_recover will replay log 
>> files to rebuild a database. Unfortunately, the corruption seems to 
>> have happened some time ago (around 3 months ago from what we can tell 
>> from the logs), and the database logs don't exist for anything before 
>> the immediate past.
>>
>> Do you know if there's a way for the integrity of the database to be 
>> checked and/or repaired? I'm slightly suspicious that the database 
>> entries might look fine to an unaware integrity checker, and only SGE 
>> is able to tell (or a human if we could see the contents) that there 
>> is a problem.
>>
>> We suspect that the corruption was down to an unclean SGE qmaster 
>> crash (or several of them) - we had a spate of crashes due to a memory 
>> leak and some oddly formatted user jobs. We use GPFS for the file 
>> system storing the database files, which implements fully posix 
>> compliant locking.
>>
>>
> 
> A bit more on this - running db_verify (both that from RedHat, and also 
> the one in $SGE_ROOT/utilbin/lx24-amd64 ) against a copy of the 
> $SGE_ROOT/$SGE_CELL/spool/spooldb folder gives:
> 
> [root at eddie01 spooldb]# 
> $SGE_ROOT/utilbin/lx24-amd64/db_verify sge
> db_verify: Page 65: item 17 of unrecognizable type
> db_verify: Page 65: item 18 of unrecognizable type
> db_verify: Page 65: item 19 of unrecognizable type
> db_verify: Page 65: item 20 of unrecognizable type
> db_verify: Page 65: item 21 of unrecognizable type
> db_verify: Page 65: item 22 of unrecognizable type
> db_verify: Page 65: item 23 of unrecognizable type
> db_verify: Page 65: item 24 of unrecognizable type
> db_verify: Page 65: item 25 of unrecognizable type
> db_verify: Page 65: item 26 of unrecognizable type
> db_verify: Page 65: item 27 of unrecognizable type
> db_verify: Page 65: item 28 of unrecognizable type
> db_verify: Page 65: item 29 of unrecognizable type
> db_verify: Page 65: item 30 of unrecognizable type
> db_verify: Page 65: item 31 of unrecognizable type
> db_verify: Page 65: item 32 of unrecognizable type
> db_verify: Page 65: item 33 of unrecognizable type
> db_verify: Page 65: item 34 of unrecognizable type
> db_verify: Page 65: item 35 of unrecognizable type
> db_verify: Page 65: item 36 of unrecognizable type
> db_verify: Page 65: gap between items at offset 14216
> db_verify: Page 65: item order check unsafe: skipping
> db_verify: sge: DB_VERIFY_BAD: Database verification failed
> 
> This isn't fixed by db_recover, which produces the following output:
> 
> [root at eddie01 spooldb]# 
> $SGE_ROOT/utilbin/lx24-amd64/db_recover -v
> Finding last valid log LSN: file: 11038 offset 2860929
> Recovery starting from [11038][2860801]
> Recovery complete at Thu Oct 16 14:40:01 2008
> Maximum transaction ID 800595b1 Recovery checkpoint [11038][2860929]
> 
> 
> Any ideas where to go from here? A re-install to fix the database will 
> be very painful...
> 
> 
> -- 
> Orlando.
> 


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list