Database outage Sat 6th Feb 2010

February 7, 2010
by David Mytton

Yesterday we experienced an outage of our server monitoring service, Server Density, lasting approximately 45 minutes. The problem was resolved and we are now in the process of implementing a number of changes to prevent a re-occurrence.

Technical Details

We are currently investigating what appears to be a bug within MongoDB that causes indexes to become corrupt on a single collection within a database. This is only on the collection storing the process list and means we are unable to delete items from it. End-user impact is therefore nil, but it affects our backend processes such as data retention. The MongoDB developers are trying to reproduce but the interim fix is to drop the indexes and recreate them.

Yesterday, this was done on a single, but large collection. The expected impact was a short period of blocking (affecting a small percentage of users due to our split database structure) as the indexes were rebuilt. This completed as expected but immediately afterwards, the entire MongoDB instance started throwing “too many files open” errors. This caused the instance to do down.

We have experienced this error before and it resolved itself after a few minutes. However, this time it persisted long enough for us to decide to failover to our off-site replicated slave. Unfortunately this was further delayed because we were waiting for the primary MongoDB instance to quit, but a bug in the network layer prevented this from happening. Further, once failed over, our checks picked up some inconsistencies with the replicated database which meant that data for the previous 24-48 hours was unavailable for some users.

In the meantime, working with the MongoDB emergency support team, we resolved the problem on the database master and so decided to bring that back online instead of remaining with the problematic failover.

Subsequent actions

We have submitted a full incident report to the MongoDB team but a number of changes will be implemented immediately:

  • Previously, our failover mechanism was entirely manual. This has been replaced with the MongoDB replica pair system. Our servers in multiple datacentres now continually communicate so that in the event of an issue, failover will occur immediately and automatically. This was not implemented sooner because it requires a period of downtime, so this outage presented a good opportunity to set it up.
  • Non-blocking index building is now available in the latest MongoDB development code and we will be deploying this once it is released in the stable branch.
  • The bug affecting closing the network layer has been fixed and will also be deployed once it is released as stable.
  • Investigation will continue into the index corruption issue.
  • We are also in the process of planning a new server architecture to incorporate more automated failover.

I’d like to apologise for this outage. If you have any questions then please get in touch.

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS