The S3 SLA is broken (and how to fix it)
At approximately 2009-01-14 05:26, Amazon's Simple Storage Service suffered some form of internal failure, resulting in a sharp increase in the rate of request failures. According to Amazon, there were "increased error rates"; according to my logs, 100% of the PUT requests the tarsnap server made to S3 failed. For somewhat more than half an hour (I don't know the exact duration) it was impossible for the tarsnap server to store any data to S3, effectively putting it out of service as far as storing backups was concerned; and presumably other S3 users met a similar fate.At approximately 2009-01-16 15:20, the S3 PUT error rate jumped from its usual level of less than 0.1% up to roughly 1%; and as I write this, the error rate remains at that elevated level. However, the tarsnap server, like all well-designed S3-using applications, retries failed PUTs, so aside from a very slight increase in effective request latency, this prolonged period of elevated error rates has had no effect on tarsnap whatsoever; nor, presumably, has it had any significant impact on any other well-designed S3-using applications.
According to the S3 Service Level Agreement, these two outages -- one which rendered applications inoperative for half an hour, and the other which had little or no impact -- are equal in severity.
This peculiar situation is caused by the overly simplistic form which the SLA takes: It provides a guarantee on the average Error Rate, completely neglecting to consider the fact that -- given that applications can retry failures -- the impact of errors is a very non-linear function of the error rate. I observed no outages in S3 during December 2008, yet even without using tricks which can be used to arficially raise the computed error rate, the occasional failures which result from S3's nature as a distributed system -- failures which occur by design -- were enough that the error rate I experienced (as computed in accordance with the SLA) was 0.098% -- just barely short of the 0.1% which would have triggered a refund. At the same time, 0.1% of a month is 40-44 minutes (depending on the number of days in the month), so if S3 failed completely for 30 minutes but every request made outside of that interval succeeded, nobody would get a refund under the SLA.
Put simply, the design of the SLA results in refunds being given in response to harmless failures, yet not being given in response to harmful failures: The wrong people get refunds.
If I were in charge at Amazon, I would adjust the S3 SLA as follows:
Definitions
- "Failed Request" means: A request for which S3 returned either "InternalError" or "ServiceUnavailable" error status.
- "Non-GET Request" means: Requests other than GET requests, e.g., PUT, COPY, POST, LIST, or DELETE requests.
- "Severely Errored Interval" for an S3 account means: A five-minute period, starting at a multiple of 5 minutes past an hour, during which either
- At least 5 GET requests associated with the account are Failed Requests, and the number of GET requests associated with the account which are Failed Requests is more than 0.5% of the total number of GET requests associated with the account; or
- At least 5 Non-GET Requests associated with the account are Failed Requests, and the number of Non-GET Requests associated with the account which are Failed Requests is more than 5% of the total number of Non-GET Requests associated with the account.
- "Monthly Uptime Percentage" means: 100% minus the number of Severely Errored Intervals divided by the total number of five-minute periods in the billing cycle (i.e., 288 times the number of days).
Three notes are in order here:
- The use of Severely Errored Intervals as a metric in place of simply computing the average Error Rate would distinguish the low baseline rate of errors which result from S3's design (and are mostly harmless) from the exceptional periods where S3's error rate spikes upwards (often, but not always, to 100%). In so doing, this change would make it possible to increase the guaranteed Monthly Uptime Percentage without increasing the number of SLA credits given.
- I distinguish between GET failures and non-GET failures for two simple reasons: First, GET failures are far less common, so it wouldn't hurt Amazon to offer a strengthened guarantee for GETs; and second, because in many situations a GET failure is more problematic than a PUT failure -- not least because web browsers downloading public files from S3 don't automatically retry failed requests.
- The dual requirement that at least 5% (or 0.5% for GETs) of requests fail AND that there be at least 5 failed requests makes it extremely unlikely that Error Rate increasing tricks could be used to artificially raise an interval across the threshold required to qualify as Severely Errored.
Now, I don't expect Amazon to adopt this suggestion overnight, and I suspect that even if they are inspired to fix the SLA they'll do it in such a way that the result is at most barely recognizable as being related to what I've posted here; but I hope this will at least spark some discussions about making the set of people who receive SLA credits better reflect the set of people affected by outages.
And Amazonians -- I know you're going to be reading this, since I logged hundreds of you reading my last post about the S3 SLA -- if this does open your eyes a bit, could you let me know? It's always a bit unsettling to see a deluge of traffic coming from an organization but not to hear anything directly. :-)
Tarsnap news
It has been just over two months since I opened the tarsnap beta to the public, and I've been busy making improvements to tarsnap -- a few bug fixes, but mostly added features. While some of the features I've added to tarsnap recently resulted from suggestions I've received in the past two months, the majority are things I've had planned for a long time; but even with those, input I've received from tarsnap beta testers has been useful, as many of the features I've added were much lower priorities until I started getting lots of emails asking for them.A detailed listing of the improvements in individual versions of the tarsnap client code is available from the tarsnap beta client download page (and an even more detailed listing can be produced using diff), but in my eyes the most important changes in the past two months are:
- Tarsnap now accepts a new --dry-run option when operating in -c (archive creation) mode. This allows you to see how much space and bandwidth would be used to archive some files before tarsnap contacts the server at all (and thus, before you spend any money); it can also be useful for verifying that --include and --exclude options are correct.
- Tarsnap also accepts new --maxbw-rate, --maxbw-rate-up, and --maxbw-rate-down options for limiting the bandwidth used in bytes per second. I think this is the most commonly requested feature in tarsnap -- mostly by people with slow internet connections, who want to make sure that tarsnap doesn't overly congest their uplink.
- A bug in the handling of hardlinked files is now fixed. Under certain conditions, when creating an archive containing a hardlinked file, tarsnap would exit with an error message of "Skip length too long". This was a very hard bug to track down: It was first reported back in June, but at the time neither I nor the reporter could manage to reproduce it; and it was only two weeks ago that it was triggered for the second time, by someone else (at which point we did manage to reproduce it).
- Tarsnap is now far more portable. Two months ago, tarsnap had been tested on FreeBSD, OS X, and a small number of Linux distributions; since then, I have made changes necessary to get tarsnap running on NetBSD, OpenSolaris, and Cygwin, and confirmed that tarsnap works on OpenBSD and a much-increased range of Linux distributions.
- Tarsnap payments are now processed automatically -- tarsnap users no longer need to wait for me to manually handle their payments. I think this was the second most often requested feature, after bandwidth rate limiting; and while mucking about with PayPal's Instant Payment Notification protocol left me feeling rather unclean, I'm certainly glad that this is automated now. As part of automating this process, the method by which tarsnap users make payments has changed slightly; instead of sending me money directly via PayPal, users must log into the tarsnap account management interface and follow links from there. Payments are still handled via PayPal -- that part of things hasn't changed -- but by having people log in to their tarsnap account first, I avoid the "where does this money belong, exactly?" head-scratching which would make automatic payment processing impossible.
- Finally, a matter of security -- at least for those of us who are particularly paranoid. Between a Certificate Authority selling bogus SSL certificates, the creation of a rogue Certificate Authority certificate by exploiting an MD5 collision, and an OpenSSL bug which could result in invalid certificates being treated as valid, there are reasons to question the security of SSL -- which immediately implies questioning the integrity of code distributed over SSL. As a result, starting with tarsnap 1.0.19, I am distributing a GPG-signed file with the SHA256 hashes of the tarsnap source code tarballs. I have created a special-purpose tarsnap code signing key for this, which is itself signed with the new GPG key which I created a few days ago. If you don't already use GPG (or PGP), I recommend that you start now.
While tarsnap is much better now than it was two months ago, I still have a long list of improvements waiting to be made -- and I'm sure there is an even longer list of improvements which I haven't thought of and nobody has asked for yet. So go get started with tarsnap and then send me your ideas for how to make it better!
New GPG key
After using the same GPG key for over five years, I decided that it was time to create a new key. I still hold the private part of my old GPG key, but I will not be using it to sign anything in the future; and when it is necessary for people to send me encrypted email, I would prefer that they use my new GPG key.My new GPG key is signed by my old key and by the FreeBSD Security Officer key, and can be downloaded here.
I have also generated a revocation certificate for my old key, which can be downloaded here.