[GE users] Barkley DB crash

ywan ywan at ed.ac.uk
Mon Feb 23 10:33:50 GMT 2009


I'm running SGE 6.1u4 on my cluster using Barkley DB storing the 
configuration info. The database crashed early this morining and failed 
the job slots on slave nodes. sgemaster and scheduler was migrated 
automatically after the failure but the database is still in error.

I manually rebooted sgemaster and scheduler on the shadow master node this 
morning, which makes the queue work again.

Can you help find out the reason why barkley DB crashed? Here are the log 
on master node (eddie04).

==============================================================================

02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (schedd:1) on host "eddie04.beowulf.cluster"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "schedd" with id 1 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:21667) on host "frontend02.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 17 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:41741) on host "frontend01.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 15 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:45881) on host "frontend02.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 11 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:45790) on host "frontend02.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 18 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:45969) on host "frontend02.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 27 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:45972) on host "frontend02.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 30 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:45978) on host "frontend02.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 31 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:3968) on host "frontend01.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 35 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:3969) on host "frontend01.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 36 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:42044) on host "frontend01.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 14 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:27364) on host "frontend01.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 13 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:11694) on host "frontend01.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 12 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:14425) on host "frontend01.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 16 
deregistered
02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
seconds for event client (qsub:25338) on host "frontend02.ecdf.ed.ac.uk"
02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 19 
deregistered
02/23/2009 
04:26:00|qmaster|eddie04|W|gethostbyname(frontend01.ecdf.ed.ac.uk) took 17 
seconds and returns success
02/23/2009 04:26:10|qmaster|eddie04|E|Corrupted database detected. Freeing 
all resources to prepare for a reconnect with recovery.
02/23/2009 04:26:10|qmaster|eddie04|E|error ending a transaction: (-30974) 
DB_RUNRECOVERY: Fatal error, run database recovery
02/23/2009 04:26:10|qmaster|eddie04|W|rule "default rule" in spooling 
context "berkeleydb spooling" failed writing an object
02/23/2009 04:26:10|qmaster|eddie04|E|got max. unheard timeout for target 
"execd" on host "node137.beowulf.cluster", can't delivering job "3117577"
02/23/2009 04:26:10|qmaster|eddie04|W|rescheduling job 3117577.1

=============================================================================




--Yuan


Yuan Wan
----
Unix Section
Information Services Infrastructure Division
University of Edinburgh

tel: 0131 650 4985
email: ywan at ed.ac.uk

2012 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=112562

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list