[GE users] couldn't open berkeley database "sge_job"

weicfa awei1231 at yahoo.com.tw
Sun Feb 21 08:42:54 GMT 2010


Hi
Our sge run on a HA environment build on veritas cluster softwave(VCS).
I had upgraded my sge from 6.2u1 to 6.2u5 6 days ago.   It was looking  good until I start 

$SGE_ROOT/$SGE_CELL/common/sgemaster on the redundant hosts. Because sge was running on the active hosts at the same time. 
sge was hang up there right away and can not stop sge by "sgemaster stop" in my active host, Finally I used "kill -9" to kill sge process. Then I try to  start sge. But I got the following messages and can not start sge anymore:

02/21/2010 00:41:43|  main|gridsvr1_1|E|couldn't open berkeley database "sge_job": (22) Invalid argument
02/21/2010 00:41:43|  main|gridsvr1_1|E|startup of rule "default rule" in context "berkeleydb spooling" failed
02/21/2010 00:41:43|  main|gridsvr1_1|C|setup failed
02/21/2010 00:42:19|  main|gridsvr1_1|E|couldn't open berkeley database "sge_job": (22) Invalid argument
02/21/2010 00:42:19|  main|gridsvr1_1|E|startup of rule "default rule" in context "berkeleydb spooling" failed
02/21/2010 00:42:19|  main|gridsvr1_1|C|setup failed
02/21/2010 00:45:28|  main|gridsvr1_1|E|couldn't open berkeley database "sge_job": (22) Invalid argument
02/21/2010 00:45:28|  main|gridsvr1_1|E|startup of rule "default rule" in context "berkeleydb spooling" failed
02/21/2010 00:45:28|  main|gridsvr1_1|C|setup failed
02/21/2010 00:50:48|  main|gridsvr1_1|E|couldn't open berkeley database "sge_job": (22) Invalid argument
02/21/2010 00:50:48|  main|gridsvr1_1|E|startup of rule "default rule" in context "berkeleydb spooling" failed
02/21/2010 00:50:48|  main|gridsvr1_1|C|setup failed

=============================================

I did'nt have the backup after I upgraded my sge to 6.2u5.  So I cp my old 6.2u1 spooldb 
directory instead the crash's 6.2u5 spooldb
because it seemed the DBD version is the same(I guess).

Then I start the sge  and it looking good.

Here are the question:

1.What is the correct action should I take when the problem happen?

2.Even I can start sge now, I still got some strange messages in qmaster messages:


========================================
02/21/2010 01:52:29|schedu|gridsvr1_1|E|callback function for event "1. EVENT MOD EXECHOST cnl64lnx3" failed
02/21/2010 01:52:29|schedu|gridsvr1_1|E|can't find cluster queue all.q for update in function qinstance_update_cqueue_list
02/21/2010 01:52:29|schedu|gridsvr1_1|E|callback function for event "2. EVENT MOD QUEUE INSTANCE all.q at di64f007" failed
02/21/2010 01:52:29|schedu|gridsvr1_1|E|can't find cluster queue heavyq for update in function qinstance_update_cqueue_list
02/21/2010 01:52:29|schedu|gridsvr1_1|E|callback function for event "3. EVENT MOD QUEUE INSTANCE heavyq at di64f007" failed
02/21/2010 01:52:29|schedu|gridsvr1_1|E|element "di64f007" does not exist

02/21/2010 01:52:29|schedu|gridsvr1_1|E|callback function for event "4. EVENT MOD EXECHOST di64f007" failed
02/21/2010 01:54:07|listen|gridsvr1_1|E|commlib error: got read error (closing "di64f007/execd/1")

============================================

I am worry about these message even sge seem running normally.  Should I upgrded my sge again?  I still keep my 6.2u1 backup configuration directory. 

It's very urgency!
Any one can help me will be very appreciate.
Thanks!

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245316

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list