[GE issues] [Issue 2879] Berkley DB corruption

crei crei at sun.com
Mon Feb 2 13:03:50 GMT 2009


Magawake,

you wrote:
...
The issue is very hard to reproduce. The issue subsided when we implemented locking on our filesystem. We are using 
ClusterFS (clusterfs.com). Our professor suggested to use localfile like feature and it stopped getting the database 
corruption.
...

I suggest to use "classic spooling" when installing qmaster on ClusterFS.

Regards,

Christian



On 01/31/09 05:21, magawake wrote:
> thanks.
> 
> yes I am using 6.1 U5. I will turn off schedd_job, which is currently
> set to true.
> 
> 
> On Fri, Jan 30, 2009 at 8:55 AM, crei <crei at sun.com> wrote:
>> Hi,
>>
>> Sorry, this does not help to much. Since you told that the database
>> problem is gone  when
>> using local spool directories I assume you had no nfs4 mounted
>> directory. BDB spooling needs
>> NFS 4 for this.
>>
>> You're using 61u5,? Right?
>> Do you have installed all recommended patches on your qmaster host?
>> Do you have schedd_job_info set to "true" (qconf -ssconf)?
>> If yes, does it help to set it to "false"?
>>
>> Regards,
>>
>> Christian
>>
>>
>> magawake schrieb:
>>> Exactly.
>>>
>>> I managed to find a lot and a strace.
>>>
>>> See this in the logs:
>>> 01/28/2009 10:15:25|qmaster|qmstr-host|I|event client "qsub" with id 16 deregistered
>>> 01/28/2009 10:15:29|qmaster|qmstr-host|E|acknowledge timeout after 600 seconds for event client (schedd:1) on host "qmstr-host.engr.unc.edu"
>>> 01/28/2009 10:15:29|qmaster|qmstr-host|I|event client "schedd" with id 1 deregistered
>>>
>>>
>>> Here is the strace
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>> futex(0xb7bc314, FUTEX_WAKE_OP, 1, 1, 0xb7bc310, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}) = 1
>>> futex(0xb7bc2b0, FUTEX_WAKE, 1)         = 1
>>>
>>> clock_gettime(CLOCK_REALTIME, {1233156119, 551724000}) = 0
>>> futex(0xb7bb9e4, FUTEX_WAIT, 256979, {0, 999980000}) = -1 ETIMEDOUT (Connection timed out)
>>> futex(0xb7bb980, FUTEX_WAKE, 1)         = 0
>>> clock_gettime(CLOCK_REALTIME, {1233156120, 553347000}) = 0
>>> futex(0xb7bb9e4, FUTEX_WAIT, 256981, {0, 999894000}) = -1 ETIMEDOUT (Connection timed out)
>>> futex(0xb7bb980, FUTEX_WAKE, 1)         = 0
>>> clock_gettime(CLOCK_REALTIME, {1233156121, 555635000}) = 0
>>> futex(0xb7bb9e4, FUTEX_WAIT, 256983, {0, 999531000}) = -1 ETIMEDOUT (Connection timed out)
>>> futex(0xb7bb980, FUTEX_WAKE, 1)         = 0
>>> clock_gettime(CLOCK_REALTIME, {1233156122, 556880000}) = 0
>>> futex(0xb7bb9e4, FUTEX_WAIT, 256985, {0, 999977000}) = -1 ETIMEDOUT (Connection timed out)
>>> futex(0xb7bb980, FUTEX_WAKE, 1)         = 0
>>> clock_gettime(CLOCK_REALTIME, {1233156123, 559724000}) = 0
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=100300
>>>
>>> To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].
>>>
>>
>> --
>> Sun Microsystems GmbH             Christian Reissmann
>> Dr.-Leo-Ritter-Str. 7             Software Engineer
>> D-93049 Regensburg                Phone: +49 (0)941 3075 112
>> Germany                           Fax:   +49 (0)941 3075 222
>> http://www.sun.de                 mailto: Christian.Reissmann at sun.com
>>                                  http://www.sun.com/gridengine
>> Sitz der Gesellschaft:
>> Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
>> Amtsgericht Muenchen: HRB 161028
>> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
>> Vorsitzender des Aufsichtsrates: Martin Haering
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=100588
>>
>> To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].
>>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=100757
> 
> To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=101350

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list