S3 glitch causes Tarsnap outage
Important details first: Between 2010-09-16 11:03:42 and 14:03:14 UTC the Tarsnap online backup service was unavailable. Attempts to use the tarsnap client software during that period resulted in "Too many network failures" / "Error communicating with server" failures. No data was lost; it merely became temporarily inaccessible.This was caused by a glitch in Amazon S3, which the Tarsnap server code uses as backing storage for a log-structured filesystem. While performing routine "cleaning" (a necessary operation in any log-structured filesystem), the Tarsnap server code attempted to read from S3 an object which was stored on August 21st (26 days ago); and received an erroneous "404 Not Found" response from S3.
Because the Tarsnap server code knew perfectly well that the object it requested from S3 should have been there, an assertion failed and the Tarsnap server was automatically shut down -- a basic preventative "if things seem to be broken, make sure we don't make anything worse until we're sure we know what's going on" measure. As it turns out, the object was still on S3, and nothing was broken except (apparently) the S3 node which sent the 404 responses.
After diagnosing the problem, I brought the Tarsnap service back online; and I have patched the Tarsnap server code to allow it to treat similar errors as transient rather than fatal in order to reduce the probability of such a glitch in S3 causing problems for Tarsnap in the future. While the original "panic and shut down" response turned out to be an over-reaction in this case, I'm glad that I erred in that direction rather than the opposite: When backups are at issue, I will always pick durability over availability.
I also sent details concerning this glitch to Amazon, and the AWS Service Health Dashboard now states that:
Between 3:45am and 4:06am PDT this morning, a subset of requests to our East Coast facilities returned 'Internal Server Error' or 'Not Found' error responses. These responses were generated due to an issue in the Amazon S3 frontend. Object storage was at no time affected by this issue. By following the best practice of re-resolving DNS and re-connecting to Amazon S3 the objects could be retrieved. The service is now operating normally.
Moral of the story: Trust but verify. No matter how fabulous the designers of a system are, you probably shouldn't trust it to get things right 100% of the time -- not even if it claims to have lost your data!