Why Tarsnap doesn't use Glacier
Two weeks ago, Amazon announced its new Glacier storage service, providing "archival" storage for as little as $0.01 per GB per month. Since I run an online backup service, this is naturally of interest to me, and in the day following Amazon's announcement I had about two dozen tweets and emails asking me if Tarsnap would be using Glacier. The answer? No. Not yet. But maybe some day in the future — if people want to use what a Glacierized Tarsnap would end up being.On the surface, Tarsnap sounds like the perfect use case for Glacier. Out of every TB of data stored on Tarsnap, in any given month approximately 100 GB of data will be deleted, while only 33 GB of data will be downloaded; in other words, Tarsnap is very much a "write once, read maybe" storage system. Tarsnap's largest operational expense is S3 storage, which is roughly ten times as expensive as Glacier (most of the rest is the EC2 instances which run the Tarsnap server code) and a large majority of Tarsnap revenue is the per-GB storage pricing. If I could switch Tarsnap over to a cheaper storage back-end, I could correspondingly reduce the prices I charge Tarsnap users, which — in spite of some of my friends advising me that Tarsnap is too cheap already — is definitely something I'd like to do. The downside to Glacier, and the reason that it is much cheaper than S3, is that retrieval is slow — you have to request data and then come back four hours later — but there are many Tarsnap users who would be willing to wait a few hours extra to retrieve their backups if it meant they could store ten times as much data for the same price.
As usual, the devil is in the details; in this case, there's one detail which makes things particularly devilish: Deduplication. Taking my personal laptop as an example: Every hour, Tarsnap generates an archive of 38 GB of files; I currently have about 1500 such archives stored. Instead of uploading the entire 38 GB — which would require a 100 Mbps uplink, far beyond what Canadian residential ISPs provide — Tarsnap splits this 38 GB into somewhere around 700,000 blocks, and for each of these blocks, Tarsnap checks if the data was uploaded as part of an earlier archive. Typically, there's around 300 new blocks which need to be uploaded; the rest are simply handled by storing pointers to the previous blocks and incrementing reference counters. (The reference counters are needed so that when an archive is deleted, Tarsnap knows which blocks are still being used by other archives.)
As a result, extracting an archive isn't simply a matter of downloading a single 38 GB blob; it involves making 700,000 separate block read requests. Retrieving data from Glacier isn't cheap: Like S3, Glacier has a per-request fee; but while S3's per-GET fee is $1 per million requests, Glacier's per-RETRIEVAL fee is $50 per million requests. I would pay $5.13 to extract that archive from Tarsnap right now; if it were stored on Glacier, it would cost Tarsnap $35 just for the Glacier retrievals alone, and Tarsnap's pricing for downloads would have to increase dramatically as a result.
It gets worse. Tarsnap doesn't merely deduplicate blocks of data; it also deduplicates blocks containing lists of blocks, and blocks containing lists of blocks of lists of blocks. This is important for reducing Tarsnap's bandwidth and storage usage — the amount of data Tarsnap uploads from my laptop each hour is less than what it would take just to list the 700,000 blocks which make up an hourly 38 GB archive — but it makes Glacier's four hour round trip from requesting data to being able to read it much worse, since you would need to read some blocks — and wait four hours — before knowing which other blocks you need to read. Clearly reading Tarsnap archives directly out of Glacier is not feasible.
But maybe reading archives which are stored in Glacier is something we can avoid. After all, when you need to restore your backups, you usually want your most recent backup. Sure, there are cases when you realize that you need an important file which you deleted two months ago, and that's why it's important to keep some older backups as well; but could Tarsnap save money by offloading those older rarely-needed backups to Glacier? Here too Tarsnap's deduplication gets in the way: Of the aforementioned 700,000 blocks comprising my latest hourly backup, only a few thousand were uploaded earlier today; the vast majority were uploaded weeks or months ago. The Tarsnap server can't simply offload "old" data to Glacier, since many of the oldest blocks of data are still included in the newest archives — and it's the most recent archives which are likely to be retrieved, not just the most recent blocks. Backup systems which work with a "full plus incrementals" approach have an advantage here: Since extracting a recent archive is never going to need data from prior to the last complete backup, older archives can be placed into "cold storage"... of course, that is counterbalanced by the fact that such a system will end up performing many "full" backups over its lifetime, dramatically increasing the amount of bandwidth and storage space used.
So it isn't feasible for the Tarsnap server to move old archives out from "fast" S3 storage to "slow" Glacier storage; but what about the Tarsnap client? Could it potentially tell the server "here's a list of blocks I don't expect to need any time soon"? It turns out that a problem arises there too — not with archive retrieval, but instead with archive creation. Consider what happens if a block of data is in Glacier, but Tarsnap's deduplication code decides that block is needed in a new archive. If the block is referenced in its location in Glacier, you would have a situation where immediately after uploading an archive, you have to wait four hours before you can download it again. Could the Tarsnap client re-upload the block so that it can be stored in S3 without waiting for it to be fetched out of Glacier? Yes, at the expense of using extra bandwidth — but if the Tarsnap server code accepted that, it would be allowing a block to be overwritten by new data, which would violate Tarsnap's security requirement that archives are immutable until they are deleted.
There is only one feasible way I can see for Tarsnap to use Glacier; in a sense, it's the simplest and most obvious one. Rather than operating at a fine-grained level with some archives being in "cold storage" and others being "warm", Tarsnap could support "glaciation" at a per-machine level. If a machine was "glaciated", it would be possible to create more archives (after all, glaciers can grow by having more snow fall on top of them!) and the storage would cost significantly less than it currently does (probably around 3-5 cents per GB per month), but you would not be able to read or delete any data. To do either of those, you would need to "deglaciate" the machine — which would take several hours and cost somewhere around 15-20 cents per GB of stored data — after which all the stored data would be back in S3 and accessible for normal random accesses. You would then be charged Tarsnap's normal storage pricing until you "glaciated" the machine again (which would most likely be free of charge).
Is this be a useful model? I'm not sure. It's not a model which is going to happen in the near future, since migrating data between S3 and Glacier in this way would involve a significant amount of careful design and coding; but if enough people are interested, it's a goal I can move towards. So tell me, dear readers: If Tarsnap allowed you to "glaciate" a machine, temporarily losing the ability to read or delete archives, but significantly reducing its monthly storage bill, would you do it?