The kivaloo data store
Just over a year ago, I sat down to a late breakfast with Patrick Collison to discuss his latest startup. At some point over the next couple of hours, we started talking about my online backup service, Tarsnap, and I mentioned that I was keeping my eye on some server-side scalability issues. "I'm OK for the next year at the current growth rate, but then I'll need to get a more sophisticated data store in place to handle block metadata; right now I'm using a very simple, obviously correct, but rather slow data structure.""I'm impressed with what the rethinks are doing, but it feels like they're doing too much — my data store needs are very minimal," I continued. "Maybe I should just write my own data store; it can't take more than a few months."
I'm very pleased to finally announce the availability of version 1.0.0 of the kivaloo data store as BSD-licensed open source software.
To be fair, kivaloo has been almost-released for a long time. Five months ago I wrote here that the first release of my data store would "be soon" and asked people to suggest possible names for it (the winner: Tim Fletcher, who suggested "kivalu", pronounced "key value"; I changed the spelling slightly, but the essential idea was his). Shortly after writing that blog post, however, I was distracted by porting FreeBSD to EC2 and then by a critical bug in Tarsnap, so it's only recently that I've had a chance to do the last few bits of work needed before kivaloo could be released.
So what is kivaloo? It is a durable, consistent, high-performance key-value data store built out of a background-garbage-collected log-structured B+Tree. Perhaps it's easier to describe what kivaloo isn't:
- It is not a key-blob store like memcached or a key-object store like Redis; each key has a single value of at most 255 bytes associated with it.
- It does not support queries like SimpleDB does; the basic operations are SET, GET, DELETE, and RANGE (which returns all key-value pairs between two specified keys).
- It does not support transactions like ACID databases do; if you want transactions, you can synthesize them at a higher level (kivaloo provides atomic compare-and-set and compare-and-delete operations, so locking can be performed through it).
- It does not provide fast random updates for large data stores like Cassandra; those are incompatible with a high data store : RAM size ratio, and I wanted efficient storage of cold data.
- It is not merely "eventually" consistent like many nosql data stores; kivaloo does not acknowledge a request until data has been written to disk and fsync has returned.
- It doesn't have any concept of users or authorization like most databases do; if you can connect to a listening socket, you can issue requests. (UNIX sockets and firewalls are your friends!)
In short, kivaloo is designed to be exactly what I need for Tarsnap. This is open source software in the time-honoured tradition of scratching an itch; I hope other people will find kivaloo useful and possibly even contribute back, but even if nobody else ever uses kivaloo, it will make a big difference to the Tarsnap server performance and scalability.
Take a look and let me know what you think.
FreeBSD/EC2 cluster compute
A few months ago, I announced experimental FreeBSD/EC2 support, and for the past four weeks FreeBSD 8.2-RELEASE AMIs have been available on Amazon EC2; but unfortunately these have been limited to "t1.micro" instances. It's impressive how much can be done with a fraction of a CPU and 600 MB of RAM; but sometimes you really need something a bit more powerful. I'm pleased to announce that, thanks to support from SegPub and vtalk, FreeBSD is now available on cc1.4xlarge instances.For those of you unfamiliar with the wide range of virtual machines available from EC2, perhaps the best way to put it is this: cc1.4xlarge instances are as big as t1.micro instances are small. They have 8 cores of 2.93 GHz Nehalem, 23 GB of RAM, two 840 GB ephemeral disks (plus as many EBS volumes as you want to create, of course), and 10 Gbps network connectivity. The name "cluster compute" suggests one way of using these instances, but Amazon would have been perfectly justified in calling these "do anything you want and still have power to spare" instances.
One of the things I've heard a lot of EC2 users say they want over the past few months is ZFS support. Linux, of course, doesn't support ZFS (userland kludges and license violations notwithstanding); and with Oracle apparently doing its best to kill OpenSolaris, FreeBSD has rapidly become the de facto standard operating system for ZFS. Unfortunately, ZFS wasn't designed for 32-bit systems with 600 MB of RAM, so attempting to run it on t1.micro instances is a very good way to cause a kernel panic; but it works beautifully on cc1.4xlarge instances.
Because cc1.4xlarge instances run "hardware virtualized" rather than "paravirtualized" Xen, they avoid most of the hard work which was needed to get t1.micro instances working. Indeed, bringing FreeBSD to cc1.4xlarge instances only required one significant bugfix (actually a workaround -- there's a bug in the Xen serial port emulation and I had to modify FreeBSD's UART code to be compatible with it) and the rest of the work was packaging and wrangling startup scripts. As a result, I have absolutely no hesitation in saying that FreeBSD is production-ready on cluster compute instances. (8.2-RELEASE on t1.micro is in "use with caution" territory -- so far it seems very stable, but it's too early to be confident about its stability.)
What's next for FreeBSD/EC2? Probably improving FreeBSD 9.0-CURRENT stability on t1.micro instances. Right now there are pmap locking (or rather, lack-of-locking) bugs in FreeBSD's paravirtualized Xen code which make 9-CURRENT far less stable than 8.2-RELEASE. With some luck I should be able to get this done before 9-STABLE branches in a few months so that it can be tested in the lead up to 9.0-RELEASE.
But this depends in large part on what FreeBSD users find that they need in EC2. Go launch some instances and let me know what you think.