[GE users] Fixing a broken Berkeley database?

Orlando Richards orlando.richards at ed.ac.uk
Thu Oct 16 14:44:12 BST 2008


Orlando Richards wrote:
> skip at pobox.com wrote:
>>     Orlando> I'm guessing that there is a corrupt entry in the 
>> database - is
>>     Orlando> there any simple way to repair it?
>>
>> If you installed Berkeley DB from source (or perhaps installed a dev 
>> RPM on
>> a Linux system) you should have a program called db_recover (or maybe
>> dbXY_recover where X and Y are the major and minor version numbers).  If
>> your database is corrupt that will quite possibly fix it.
>>
>> Longer term the more important question to answer is how it got 
>> corrupted in
>> the first place.  Make sure multi-program access to the database is 
>> either
>> mediated by a single program your clients communicate with or that your
>> programs use some sort of file locking scheme to prevent multiple
>> simultaneous accesses.
>>
> 
> Thanks for that Skip - I've installed the db4-utils package on our 
> RedHat box that includes db_recover. However - we're not running a 
> separate Berkeley DB server, but instead the file-based option that SGE 
> has "bundled" up with it, so I'm a bit confused as to how to recover the 
> database. From what I can tell, db_recover will replay log files to 
> rebuild a database. Unfortunately, the corruption seems to have happened 
> some time ago (around 3 months ago from what we can tell from the logs), 
> and the database logs don't exist for anything before the immediate past.
> 
> Do you know if there's a way for the integrity of the database to be 
> checked and/or repaired? I'm slightly suspicious that the database 
> entries might look fine to an unaware integrity checker, and only SGE is 
> able to tell (or a human if we could see the contents) that there is a 
> problem.
> 
> We suspect that the corruption was down to an unclean SGE qmaster crash 
> (or several of them) - we had a spate of crashes due to a memory leak 
> and some oddly formatted user jobs. We use GPFS for the file system 
> storing the database files, which implements fully posix compliant locking.
> 
> 

A bit more on this - running db_verify (both that from RedHat, and also 
the one in $SGE_ROOT/utilbin/lx24-amd64 ) against a copy of the 
$SGE_ROOT/$SGE_CELL/spool/spooldb folder gives:

[root at eddie01 spooldb]# 
/exports/applications/sge/utilbin/lx24-amd64/db_verify sge
db_verify: Page 65: item 17 of unrecognizable type
db_verify: Page 65: item 18 of unrecognizable type
db_verify: Page 65: item 19 of unrecognizable type
db_verify: Page 65: item 20 of unrecognizable type
db_verify: Page 65: item 21 of unrecognizable type
db_verify: Page 65: item 22 of unrecognizable type
db_verify: Page 65: item 23 of unrecognizable type
db_verify: Page 65: item 24 of unrecognizable type
db_verify: Page 65: item 25 of unrecognizable type
db_verify: Page 65: item 26 of unrecognizable type
db_verify: Page 65: item 27 of unrecognizable type
db_verify: Page 65: item 28 of unrecognizable type
db_verify: Page 65: item 29 of unrecognizable type
db_verify: Page 65: item 30 of unrecognizable type
db_verify: Page 65: item 31 of unrecognizable type
db_verify: Page 65: item 32 of unrecognizable type
db_verify: Page 65: item 33 of unrecognizable type
db_verify: Page 65: item 34 of unrecognizable type
db_verify: Page 65: item 35 of unrecognizable type
db_verify: Page 65: item 36 of unrecognizable type
db_verify: Page 65: gap between items at offset 14216
db_verify: Page 65: item order check unsafe: skipping
db_verify: sge: DB_VERIFY_BAD: Database verification failed

This isn't fixed by db_recover, which produces the following output:

[root at eddie01 spooldb]# 
/exports/applications/sge/utilbin/lx24-amd64/db_recover -v
Finding last valid log LSN: file: 11038 offset 2860929
Recovery starting from [11038][2860801]
Recovery complete at Thu Oct 16 14:40:01 2008
Maximum transaction ID 800595b1 Recovery checkpoint [11038][2860929]


Any ideas where to go from here? A re-install to fix the database will 
be very painful...


--
Orlando.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list