[GE issues] [Issue 3198] Deleting several simple sleeper jobs with qdel can cause queues going to ERROR or corrupt BDB server

pollinger harald.pollinger at sun.com
Mon Nov 30 19:19:01 GMT 2009


User pollinger changed the following:

                What    |Old value                 |New value
                  Status|NEW                       |RESOLVED
              Resolution|                          |FIXED

------- Additional comments from pollinger at sunsource.net Mon Nov 30 11:18:58 -0800 2009 -------
This is not a classical deadlock. Sifting trough the in-core data of the RPC-server process and the data base I found that we have two
cursor operations running:
(i) one cursor is open for qmaster and used to delete records. Makes sense since we are deleting jobs. This cursor is wrapped with a
transaction. Fine.

(ii) a second cursor operation is active for spooledit process. This cursor is not wrapped with a transaction. Not fine.

The problem here is that BDB can't cope with a mix of transaction-protected cursor operations and non-transaction-protected cursor ops when
running in the context of a single thread of control. And this is true for BDB rpc-server.

Transaction protected cursor ops need to hold the locks till transaction commit.
Hence, in case a non-transactional cursor comes in, there are good chances for a lock conflict. In our case, the non-transactional cursor op
blocks. This renders the RPC server useless and so qmaster and prevents us from pressing the transaction forward. We are stuck.

I checked out the code for spooledit. Appears that all operations (with the exception of the list option) are wrapped by transactions. I'll
build a test thingy with transaction protection for the list option and see what happens. Should work!


To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list