Adding durability to the kivaloo data store
365 days ago, I announced my kivaloo data store here. Architected to maximize performance for the particular workload my online backup service has, it provides a much better cost:performance ratio than Amazon's DynamoDB; but as a single-machine data store it had some limitations:- It wrote data to a single disk, so its durability was limited to the durability of "local (fsynced) disk";
- It ran on a single machine, so its availability was limited to the availability of a single machine and its network connection; and
- It was mostly single-threaded (I/O had separate threads, but all the "real" work was done in a single thread) so its performance was limited to about 100k 80-byte key-value pairs per second.
One of my design principles for kivaloo from the start has been that it should be as modular as possible. The core B+Tree code, for example, takes key-value operations and builds a log-structured tree; but it doesn't write that tree directly to disk. Rather, it hands off the blocks to a separate daemon whose only role is to sit in front of the underlying storage device and push blocks back and forth. As such, replacing "storage on local disk" with "storage somewhere other than on local disk" becomes easy: Just write a new storage daemon.
Since the most critical limitation of kivaloo for my purposes was the lack of multi-disk durability, I decided to replace the storage daemon with one which uses the most durable storage I know: Amazon S3. This, in keeping with kivaloo's design, involved writing two daemons: One which implements log-structured storage on top of S3 — taking care to avoid "eventual consistency" chaos by making careful use of the read-after-create consistency in all of the non-US-Standard S3 regions — and a second daemon which acts merely as S3 proxy, picking which S3 IP address to use (and performing the messy but necessary stream of DNS requests in a helper process), signing the requests, and forwarding them along.
Since this is no longer using local storage but instead writing all the key-value data over the network — and to a storage service better known for bulk data storage than for fast transactional performance — you might expect to see lower performance from this, and judging by my first (and very preliminary, performed less than two hours after I finished squashing bugs) benchmarks you'd be right. Instead of about 125,000 key-value pair writes per second, I'm seeing only about 30,000 — on an EC2 c1.medium instance.
But for many applications 99.999999999% durable storage will easily trump a slightly higher request rate; and even with the S3 backend, kivaloo is still two orders of magnitude ahead of DynamoDB in its cost:performance ratio... for my workload, at least. More of a problem is availability; while there isn't any local state except for the parts of the B+Tree which are cached in RAM, if you're running kivaloo with the S3 backend on an EC2 instance which dies it could take as much as 60 seconds to launch a new EC2 instance and be ready to accept more requests.
Let's see if I can get replication done before this time next year!