About this week’s downtime

Written By Juan Carlos Muriente on January 30th, 2010

This past week Tweetboard experienced a downtime period that extended over four days. Given the service level we have committed to provide, we believe that an explanation is owed to our loyal users.

For a long time we have been planning upgrading our database to utilize a table partitioning approach which would help better scale our storage hardware. We scheduled the upgrade for this past weekend.

The first stage went without an issue, the database was updated quickly and easily with less than 5min downtime while all services were restarted. On Saturday however, while working on the partitioning, the process required we drop the primary key index in our messages table and recreate a regular index instead. However, during the process, myisamchk ended up with a zero file size MYI for the messages table. This was a MAJOR PROBLEM as it meant that all of the messages (over 200 million posts) that we have built up over the past 7 months were gone. The same file was corrupted in our database backup during a simultaneous move due to space constraints.

A data recovery procedure would have required to physically remove the hardware to send it out, OR a remote attempt at recovering the data which would have cost $4K to $9K. Our best option was to attempt to recover the data ourselves.

We then moved to trying various “disk forensic” file recovering tools, including: “Foremost”, “TestDisk”, “PhotoRec”, “ext3undelete”, “sleuthkit” and others, but none were able to fully recover the data, due to the file being of such a big size, fragmented, and missing header/footer info.

We turned to manual analysis of the partition and even took the time to learn the ins and outs of low-level ext3 filesystem. The drive was instantly unmounted once we found about the corruption, which stopped data from being written over most blocks. That allowed us to reconstruct the chain of blocks, since we knew the location, size, and last deletion time of the file.

On Monday morning we had managed to recover 93% of the corrupt file which would in turn mean that approx. 90% of the actual database could be recovered. This was brilliant news but there was still a lot of work to be done and there was still a possibility that the database would not validate.

The team continued to press forward and yesterday (Thursday, 28th Jan) the recovery process had been completed and we could resume the original task of partitioning the database and updating the Tweetboard scripts to work with it. Finally, yesterday Tweetboard was brought back up and there was only one task left to do, write this blog and tell you our story.

What do we expect moving forward? Well, the database is back up as is Tweetboard, but we did still lose that 7-10% so we will be running background scripts over the next few days to recover all tweets that were lost.

We thank you all for your support and we do apologize for the downtime.

  • crysisnews
    tweetboard, crysisnews.com, web site to request an invitation to have my invitation did not come yet please :/ Can we accept? http://www.crysisnews.com/#tboard
  • Hi,

    Invites can take between 1-14days to be accepted I can see your request has been recorded and you will be contacted when you are accepted.

    Thank you

    Ceri
  • juicycolors
    Cool, i am waiting for the confirmation !!!
  • sisterstalk
    It appears you're still having problems. My Tweetboard hasn't updated in over 21 hours.
  • Pulling of new tweets is unrelated, it seems the problem you are having is specific to yourself and is related to twitter auth codes... As advised via tweetboard please make sure you refresh the page and then login and authorise the tweetboard app using your main twitter username, without this we can not pull new tweets.
  • Corey Ward
    Ouch. Note to TweetBoard: backup your database before making schema changes. I like to keep another db server ready and synced at all times, then if changes break something the application can roll over to the mirrored server without hardly any data loss while the primary database can be recovered.
  • Due to space restraints the backup was done simultaneously to the first step expecting errors latter on in the stage. Lesson learned and we will always assume the worst with every change from now on.
  • Irwin
    Seriously you should move your database to a cloud service like google appengine or S3 to avoid such issues. Why do you store this in an old fashioned SQL instance ? it s just more expensive, hard to maintain and prone to errors. If more people use your service you really need to change your infrastructure.
  • It has been considered but we couldn't get the same speed or I/O as we do with the SSD drives.
  • Uncle_Unix
    Good save!. Been there done that a few to many times for customers.

    Thanks for all
  • Sorry for the delay in replying to you Disqus had moderated your comment for some reason. It was a *fun* ordeal but it is one we have learned from.
  • Well, I guess we have to thank you for all the hard work in getting evrything back and up and running again.
  • It is our pleasure we just though we owed you all an explanation and hopefully the post above does just that.
  • Bill Consley
    Thank you for sharing this with us, I can see it was a very hard process and I am glad to have tweetboard back.
blog comments powered by Disqus