Dissecting SimpleDB BoxUsage
Billing for usage of a database server which is shared between many customers is hard. You can't just measure the size of databases, since a heavily used 1 GB database is far more resource-intensive than a lightly used 100 GB database; you can't just count queries, since some queries require far more CPU time -- or disk accesses -- than others; and you can't even time how long queries take, since modern databases can handle several queries in parallel, overlapping one query's CPU time with another query's disk time. When Amazon launched their SimpleDB service, it looked like they had found a solution in BoxUsage: As the website states,Amazon SimpleDB measures the machine utilization of each request and charges based on the amount of machine capacity used to complete the particular request [...]and reports back a BoxUsage value in every response returned by SimpleDB. Sadly, this "measurement" is fictitious: With the possible exception of Query requests, BoxUsage values returned by SimpleDB are entirely synthetic.
Take creating a domain, for example. Issue a CreateDomain request, and SimpleDB will report back to you that it took 0.0055590278 machine hours -- never 0.0055590277 or 0.0055590279 hours, always exactly 0.0055590278 machine hours. Deleting a domain? Exactly the same: Whether the domain is empty or contains lots of items -- for that matter, even if the domain doesn't exist -- the BoxUsage reported will be exactly 0.0055590278 hours. Listing the domains you have? That costs 0.0000071759 hours -- again, never even a tenth of a nano-hour more or less.
So much for domains; what about storing, retrieving, and deleting data? Issue a PutAttributes call with one attribute, and it will cost 0.0000219909 hours -- no matter if the item already exists or not, no matter if the item name, attribute name, and value are one character long or 100 characters long. Issue a PutAttributes call with two attributes, and it will cost 0.0000219923 hours. Three attributes costs 0.0000219961 hours. Four attributes costs 0.0000220035 hours. See the pattern yet? If not, don't worry -- it took me a while to figure this one out, mostly because it was so surprising: A PutAttributes call with N attributes costs 0.0000219907 + 0.0000000002 N^3 hours. Yes, that's right: The cost is cubic in the number of attributes -- and I can't imagine any even remotely sane algorithm which would end up with an O(N^3) cost.
Retrieving stored data is cheaper: A GetAttributes call which returns N attribute-value pairs costs 0.0000093202 + 0.0000000020 N^2 hours (since the pricing depends on the number of values returned, not the number of values in the item in question, there's good incentive to specify which attributes you're interested in when you send a GetAttributes request). Deleting stored data? Back to cubic again: A DeleteAttributes call with N attributes specified costs 0.0000219907 + 0.00000000002 N^3 hours -- exactly the same as a PutAttributes call with the same number of attributes. Of course, DeleteAttributes has the advantage that you can specify just the item name and not provide any attribute names, in which case all of the attributes associated with the item will be deleted -- and if you do this, the reported BoxUsage is 0.0000219907 hours, just like the formula predicts with N = 0.
The last type of SimpleDB request is a Query: "Tell me the names of items matching the following criteria". Here SimpleDB might actually be measuring machine utilization -- but I doubt it. More likely, the formula just happens to be sufficiently complicated that I haven't been able to work it out. What I can say is that a Query of the form [ 'foo' = 'bar' ] -- that is, "Tell me the names of the items which have the value 'bar' associated with the attribute 'foo'" -- costs 0.0000140000 + 0.0000000080 N hours, where N is the number of matching items; and that even for the more complicated queries which I tried, the cost was always a multiple of 0.0000000040 hours.
Now, there are a lot of odd-looking numbers here -- the variable costs are all small multiples of a tenth of a nano-hour, and the overhead cost of a Query is 14 micro-hours, but the others look rather strange. Convert them to seconds and apply rational reconstruction, however, and they make a bit more sense:
- 0.0055590278 hours = 4803 / 240 seconds.
- 0.0000071759 hours = (31/5) / 240 seconds.
- 0.0000219907 hours = 19 / 240 seconds.
- 0.0000093202 hours = (153/19) / 240 seconds.
Putting this all together, here's what SimpleDB requests cost (at least right now); μ$ means millionths of a dollar (or dollars per million requests):
Request Type | BoxUsage (hours) | BoxUsage (seconds) | Overhead Cost (μ$) | Variable Cost (μ$) |
CreateDomain DeleteDomain | 0.0055590278 | 4803 / 240 | 778.264 | |
ListDomains | 0.0000071759 | (6 + 1/5) / 240 | 1.005 | |
PutAttributes (N attributes specified) DeleteAttributes (N attributes specified) | 0.0000219907 + 0.0000000002 N^3 | 19 / 240 + 0.00000072 N^3 | 3.079 | 0.000028 N^3 |
GetAttributes (N values returned) | 0.0000093202 + 0.0000000020 N^2 | (8 + 1/19) / 240 + 0.00000720 N^2 | 1.305 | 0.000280 N^2 |
Query (N items returned) | 0.0000140000 + 0.0000000080 N or more | 0.0504 + 0.00002880 N or more | 1.960 | 0.001120 N or more |
What can we conclude from this? First, if you want to Put 53 or more attributes associated with a single item, it's cheaper to use two or more requests due to the bizarre cubic cost formula. Second, if you want to Get attributes and expect to have more than 97 values returned, it's cheaper to make two requests, each of which asks for a subset of the attributes. Third, if you have an item with only one attribute, and your read:write ratio is more than 22:1, it's cheaper to use S3 instead of SimpleDB -- even ignoring the storage cost -- since S3's 1 μ$ per GET is cheaper than SimpleDB's 1.305 μ$ per GetAttributes request. Fourth, someone at Amazon was smoking something interesting, since there's no way that a PutAttributes call should have a cost which is cubic in the number of attributes being stored.
And finally, given that all of these costs are repeatable down to a fraction of a microsecond: Someone at Amazon may well have determined that these formulas provide good estimates of the amount of machine capacity needed to service requests; but these are most definitely not measurements of anything.