S3 glitch causes Tarsnap outage
Important details first: Between 2010-09-16 11:03:42 and 14:03:14 UTC the Tarsnap online backup service was unavailable. Attempts to use the tarsnap client software during that period resulted in "Too many network failures" / "Error communicating with server" failures. No data was lost; it merely became temporarily inaccessible.This was caused by a glitch in Amazon S3, which the Tarsnap server code uses as backing storage for a log-structured filesystem. While performing routine "cleaning" (a necessary operation in any log-structured filesystem), the Tarsnap server code attempted to read from S3 an object which was stored on August 21st (26 days ago); and received an erroneous "404 Not Found" response from S3.
Because the Tarsnap server code knew perfectly well that the object it requested from S3 should have been there, an assertion failed and the Tarsnap server was automatically shut down -- a basic preventative "if things seem to be broken, make sure we don't make anything worse until we're sure we know what's going on" measure. As it turns out, the object was still on S3, and nothing was broken except (apparently) the S3 node which sent the 404 responses.
After diagnosing the problem, I brought the Tarsnap service back online; and I have patched the Tarsnap server code to allow it to treat similar errors as transient rather than fatal in order to reduce the probability of such a glitch in S3 causing problems for Tarsnap in the future. While the original "panic and shut down" response turned out to be an over-reaction in this case, I'm glad that I erred in that direction rather than the opposite: When backups are at issue, I will always pick durability over availability.
I also sent details concerning this glitch to Amazon, and the AWS Service Health Dashboard now states that:
Between 3:45am and 4:06am PDT this morning, a subset of requests to our East Coast facilities returned 'Internal Server Error' or 'Not Found' error responses. These responses were generated due to an issue in the Amazon S3 frontend. Object storage was at no time affected by this issue. By following the best practice of re-resolving DNS and re-connecting to Amazon S3 the objects could be retrieved. The service is now operating normally.
Moral of the story: Trust but verify. No matter how fabulous the designers of a system are, you probably shouldn't trust it to get things right 100% of the time -- not even if it claims to have lost your data!
In Defence of Facts
I've long been a believer in the idea of voluntary public service, and one of the ways I attempt to serve my community is as one of four members of the Senate of Simon Fraser University who are elected by and from the university alumni. The Senate, taking a role dating back to medieval guild Masters, is the senior body of academic governance of the university; it carries ultimate responsibility for deciding who should be admitted to the university and to whom credentials should be granted.On Monday evening, I was on the losing side of a vigorous debate over the adoption of a new admissions policy; the precise details are not relevant to this post, save for the observations that it was highly politically charged, that there were very strong opinions on both sides, and that I wasn't particularly surprised to find myself in the minority. The debate was civil, and most of the speakers made good points (even if I happen to disagree with their ultimate conclusions); but I was absolutely shocked by the comments from one speaker.
I opened the debate by speaking against the motion. Having spent the weekend doing research into the issue, I presented data from the university's internal statistics; from the provincial ministry of advanced education; and from Canada's federal statistical agency to show that the problem the policy was intended to solve in fact didn't exist.
Now, I knew it would be too much to hope that this would put an end to the question; I did hope, however, that it would sway a few votes. Surely, I thought, policy decisions must be guided by facts. Not so. Instead, I was castigated for "bringing numbers and percentages into the debate"!
Stephen Colbert famously invented the word "truthiness" to refer to the "truths" which a person claims to know "from the gut" without regard to evidence or logic. He applied it to George Bush's decision to invade Iraq in spite of the total absence of evidence linking Iraq to Al Quaeda or the 9/11 attacks; and we all pointed and laughed.
It's bad enough when the President of the United States of America, armed with thousands of nuclear weapons, makes decisions without regard to facts. When truthiness enters the debates of the senior academic governing body of a university, I fear for our society.
My bank stole 9 cents
At the end of each month I spend about half an hour doing accounting for my Tarsnap online backup service. I record the number and total amount of incoming payments, the fees charged by PayPal, the amount of backup usage which Tarsnap users were charged for, the website hosting costs, et cetera. A few days later, when Amazon Web Services finishes its monthly accounting I record that number as well, at which point I know how much profit Tarsnap made in the month. Today I took some extra time to compare numbers, and I came to an unsettling realization: My bank stole my money -- 0.09 US dollars of it, to be precise.For obscure reasons involving PayPal and international banking, I withdraw money first from my PayPal account to an account at Harris Bank, then write myself a cheque (yes, on paper) which I deposit into a US dollar account at the Royal Bank of Canada, and then finally pay myself by converting this money into Canadian dollars (ideally when the exchange rate is good). On April 27 the second step failed.
On that day, I was depositing two US dollar cheques: One was a cheque for $X.19 from Tarsnap, and the other was a cheque for $Y.00 for some consulting I was doing. The automated teller machines at RBC branches are only able to handle Canadian dollar transactions, so I waited in line for an available bank teller and then handed over the two cheques asking that they be deposited into my US dollar account. She typed in the amounts of the two cheques; passed them through a machine which printed some information on the back; stamped them; and then entered the total deposit value into her computer as $(X+Y).10 -- 9 cents less than the correct total. I glanced at the receipt, thanked her, and left -- and didn't notice the discrepancy until 4 months later.
Several things went wrong here.
First, I should have checked the receipt more closely -- but I was distracted with thinking about key-value data stores and at first glance the receipt looked correct. Second, the bank teller should have input the correct total. I don't know if this was a typographical error or an error in reading the cheque -- my writing certainly can be difficult to read at times -- but there was certainly an error at some point.
Most importantly, however, the bank's computer systems should have refused to accept a deposit where the total value did not match the sum of the individual elements! If I was depositing cheques at an ATM, it certainly wouldn't have allowed me to enter a total different from the sum of the values I entered for the individual cheques; but I can see why this validation might have been missing in this case.
US dollar transactions are considered to be "foreign currency transactions" (which, of course, they are) and thus are subject to exchange rates (which isn't necessarily appropriate). If you transfer money between two US dollar bank accounts at RBC, the transaction will be shown as "foreign exchange" at a USD/USD exchange rate of 1.000000. If the items being deposited are in a different currency than the account they are being deposited into, the deposit value will almost certainly not match the sum of the values of the items being deposited, so skipping such a validation step is reasonable; but the bank's computer systems treat a transaction as foreign exchange if any non-Canadian dollars are involved, rather than treating it as foreign exchange only if there is more than one currency involved.
As a computer scientist who abhors even the slightest of errors, I hope RBC fixes its systems and correctly validates deposits in the future; but as a busy software developer, I'm going to mark this down in my accounting as "bank error in bank's favour" and make sure I check receipts more closely in the future.