[GE users] DB failure

Chris Dagdigian dag at sonsorol.org
Thu May 11 14:47:58 BST 2006


I know that the 'way of the future' for SGE is berkeley db based  
spooling, especially when we get all the replication stuff from the  
Berkeley product,  but for many of the clusters I've worked on, a  
performance gain is simply not worth the inconvenience of having all  
the spool data in a non-human-readable binary storage format.  On  
more than one occasion we've also lost entire SGE configurations when  
(for instance) buggy Apple XSAN software decides to take a random  
unscheduled coffee break. Recovering corrupt berkeley files is not  
fun -- for SVN repositories or SGE configurations.

Our rule of thumb now is classic spooling for all clusters less than  
32 nodes in size, even for people with high job throughput volumes.  
We tell the people with high job volumes to get used to Grid Engine  
for a while and when they are ready to start another round of  
optimization and performance tuning efforts they should simply tack  
on the possibility of a switch to berkeley spooling as one of the  
potential options. Scheduler tuning, filesystem performance and end- 
user workflows seem to have far more impact on performance and  
throughput than the underlying spooling technology.

This opinion, of course, is colored by experience with lots of small  
systems (2 to 10 nodes on average it seems) rather than a few massive  
installations so other peoples experiences could and probably does  
differ from mine.

I was thinking though that it may be a good idea to write up a RFE  
for the installation scripts -- maybe a bit more text in the spooling  
choice screen that tells people they may want to choose classic mode  
if their system is under N nodes in size.  The way the docs and  
installation scripts look now, a new user will probably always choose  
berkeleydb simply because classic is presented as in a way that makes  
it look (to a new user) either "outdated" or "not-a-best-practice- 
anymore"

-Chris






On May 11, 2006, at 9:26 AM, Rayson Ho wrote:

> If no one has a better solution, then you can at least start by
> reading the install script and see how the DB got initialized...
>
> BTW, if your cluster is small, or the volume of job is low, you should
> use "classic spooling" - way lot easier to maintain than BDB...
>
> Rayson
>
>
>
>
>
> On 5/11/06, Ari P Seitsonen <ari.p.seitsonen at iki.fi> wrote:
>>
>> Dear experts on SGE,
>>
>>   The main disc of our small Opteron cluster ran full the other  
>> day, and
>> thus SGE (v6.0u7_1, compiled from the source code) crashed. Now  
>> I'm trying
>> to restart it again, but all I get is
>>
>> # 05/10/2006 13:29:00|qmaster|curienite|E|couldn't open database  
>> environment for server "local spooling", directory "/opt/software/ 
>> sge/v6.0u7_1-target/default/spool/spooldb": (-30974)  
>> DB_RUNRECOVERY: Fatal error, run database recovery
>> # 05/10/2006 13:29:00|qmaster|curienite|E|startup of rule "default  
>> rule" in context "berkeleydb spooling" failed
>> # 05/10/2006 13:29:00|qmaster|curienite|C|setup failed
>>
>> when I try to run
>>
>> './default/common/sgemaster start'
>>
>>   It doesn't help even if I do 'db_recover' or 'db_recover -c' before
>> that. Does any one have an idea what to do? At least to create new
>> database, the users are impatiently waiting...
>>
>>     Thanks and greetings,
>>
>>        apsi
>>
>> -=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=- 
>> =*=-=*=-=*=-
>>   Ari P Seitsonen / Ari.P.Seitsonen at iki.fi / http://www.iki.fi/~apsi/
>>   GSM: +33-6-6736 3820
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list