[GE users] Barkley DB crash

crei crei at sun.com
Fri Feb 27 09:37:54 GMT 2009


Hi,

It might be related to some network problems, because the hostname
resolving took 17 seconds:
04:26:00|qmaster|eddie04|W|gethostbyname(frontend01.ecdf.ed.ac.uk) took 17
seconds and returns success

Is BDB running on a local file system or NFS 4?

Regards,

Christian





On 02/23/09 11:33, ywan wrote:
> I'm running SGE 6.1u4 on my cluster using Barkley DB storing the 
> configuration info. The database crashed early this morining and failed 
> the job slots on slave nodes. sgemaster and scheduler was migrated 
> automatically after the failure but the database is still in error.
> 
> I manually rebooted sgemaster and scheduler on the shadow master node this 
> morning, which makes the queue work again.
> 
> Can you help find out the reason why barkley DB crashed? Here are the log 
> on master node (eddie04).
> 
> ==============================================================================
> 
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (schedd:1) on host "eddie04.beowulf.cluster"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "schedd" with id 1 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:21667) on host "frontend02.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 17 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:41741) on host "frontend01.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 15 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:45881) on host "frontend02.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 11 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:45790) on host "frontend02.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 18 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:45969) on host "frontend02.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 27 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:45972) on host "frontend02.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 30 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:45978) on host "frontend02.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 31 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:3968) on host "frontend01.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 35 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:3969) on host "frontend01.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 36 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:42044) on host "frontend01.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 14 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:27364) on host "frontend01.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 13 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:11694) on host "frontend01.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 12 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:14425) on host "frontend01.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 16 
> deregistered
> 02/23/2009 04:25:31|qmaster|eddie04|E|acknowledge timeout after 600 
> seconds for event client (qsub:25338) on host "frontend02.ecdf.ed.ac.uk"
> 02/23/2009 04:25:31|qmaster|eddie04|I|event client "qsub" with id 19 
> deregistered
> 02/23/2009 
> 04:26:00|qmaster|eddie04|W|gethostbyname(frontend01.ecdf.ed.ac.uk) took 17 
> seconds and returns success
> 02/23/2009 04:26:10|qmaster|eddie04|E|Corrupted database detected. Freeing 
> all resources to prepare for a reconnect with recovery.
> 02/23/2009 04:26:10|qmaster|eddie04|E|error ending a transaction: (-30974) 
> DB_RUNRECOVERY: Fatal error, run database recovery
> 02/23/2009 04:26:10|qmaster|eddie04|W|rule "default rule" in spooling 
> context "berkeleydb spooling" failed writing an object
> 02/23/2009 04:26:10|qmaster|eddie04|E|got max. unheard timeout for target 
> "execd" on host "node137.beowulf.cluster", can't delivering job "3117577"
> 02/23/2009 04:26:10|qmaster|eddie04|W|rescheduling job 3117577.1
> 
> =============================================================================
> 
> 
> 
> 
> --Yuan
> 
> 
> Yuan Wan
> ----
> Unix Section
> Information Services Infrastructure Division
> University of Edinburgh
> 
> tel: 0131 650 4985
> email: ywan at ed.ac.uk
> 
> 2012 Computing Services, JCMB
> The King's Buildings,
> Edinburgh, EH9 3JZ
> 
> 

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=115952

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list