Why Tarsnap won't use DynamoDB
When I heard last Wednesday that Amazon was launching DynamoDB I was immediately excited. The "hard" server-side work for my Tarsnap online backup service consists mostly of two really big key-value maps, and I've spent most of the past two years trying to make those faster and more scalable. Having a key-value datastore service would make my life much simpler, I thought. Unfortunately, upon reading into the details, I've decided that DynamoDB — at least in its present form — isn't something I want to use for Tarsnap.As I've blogged about before, the Tarsnap server code synthesizes a log-structured filesystem on top of Amazon S3. As a result of this design, for each block of data stored on Tarsnap, the server code needs to keep track of two key-value pairs: First, to map a 64-bit machine number and a 256-bit block ID to a 64-bit log entry number and a 32-bit block length; and second, to map the 64-bit log entry number into a 64-bit S3 object ID and an offset within that block (Tarsnap aggregates multiple blocks into each S3 object in order to amortize the S3 PUT cost). The first of these key-value pairs is 53 bytes long and stays fixed until a block is deleted; the second key-value pair is 24 bytes long and needs to be updated every time the log cleaner reads the block from S3 and writes it back in a new position.
Time for some numbers. The average block size as seen on the Tarsnap server is 33 kB; the Tarsnap client code generates blocks of 64 kB on average, but it deflates blocks by roughly a factor of two on average (obviously some data doesn't compress at all, while other data gets compressed far more). DynamoDB costs $1/GB per month for storage, including a 100 byte overhead per key-value pair; and $0.01/hour for quanta of 10 writes/second or 50 reads/second. I want to perform a complete cleaning run every 14 days in order to avoid spending too much money storing deleted blocks of data; and due to inconsistent loads, I want to have reserved capacity at least double my average throughput.
For each TB of data stored, this gives me 30,000,000 blocks requiring 60,000,000 key-value pairs; these occupy 2.31 GB, but for DynamoDB pricing purposes, they count as 8.31 GB, or $8.31 per month. That's about 2.7% of Tarsnap's gross revenues (30 cents per GB per month); significant, but manageable. However, each of those 30,000,000 blocks need to go through log cleaning every 14 days, a process which requires a read (to check that the block hasn't been marked as deleted) and a write (to update the map to point at the new location in S3). That's an average rate of 25 reads and 25 writes per second, so I'd need to reserve 50 reads and 50 writes per second of DynamoDB capacity. The reads cost $0.01 per hour while the writes cost $0.05 per hour, for a total cost of $0.06 per hour — or $44 per month. That's 14.6% of Tarsnap's gross revenues; together with the storage cost, DynamoDB would eat up 17.3% of Tarsnap's revenue — slightly over $0.05 from every $0.30/GB I take in.
Of course, if that was the only cost Tarsnap had, I'd be overjoyed; but remember, this is just for managing metadata. I'd still have the cost of S3 storage for the data itself, and I'd still have to pay for EC2 instances. Would an extra 5 cents per GB make Tarsnap unprofitable? No; but it would definitely hurt the bottom line. Instead of using DynamoDB, I'm continuing with the project I started a couple of years ago: My Kivaloo data store.
Do I think Kivaloo is better than DynamoDB? Not for most people. But there's important differences which make Kivaloo more suitable for me. For a start, DynamoDB is fast — very fast. It's built on SSDs and delivers "single-digit millisecond" read and write latencies. But backups don't need low latencies; if you're backing up several GB of data you're hardly going to notice a few milliseconds. Kivaloo targets high throughput instead — which is to say, lower cost for a given throughput. Even more importantly, Tarsnap has a peculiar access pattern: Almost all the writes occur in sequential regions. DynamoDB, because it aims for consistent general-purpose performance, needs to be good at handling the worst-case scenario of random accesses; Kivaloo, in contrast, uses a B+Tree which is ideal for handling bulk ordered inserts and updates.
In the end, it comes down to the numbers: On an EC2 c1.medium instance (costing $0.17/hour) Kivaloo can perform 130,000 inserts per second and 60,000 updates per second with Tarsnap's access pattern. On DynamoDB, that would cost $60-$130/hour. True, Kivaloo isn't replicated yet, and when I add that functionality it will increase the cost and decrease the throughput; but even then it will have a price/performance advantage of two orders of magnitude over DynamoDB.
I think DynamoDB is a great service, and I'd encourage everybody to explore its possible uses. But as far as Tarsnap is concerned, DynamoDB may be the world's best hammer, but what I really need is a screwdriver.