A very valuable vulnerability
While I very firmly wear a white hat, it is useful to be able to consider things from the perspective of the bad guys, in order to assess the likelihood of a vulnerability being exploited and its potential impact. For the subset of bad guys who exploit security vulnerabilities for profit — as opposed to selling them to spy agencies, for example — I imagine that there are some criteria which would tend to make a vulnerability more valuable:- the vulnerability can be exploited remotely, over the internet;
- the attack cannot be blocked by firewalls;
- the attack can be carried out without any account credentials on the system being attacked;
- the attack yields money (as opposed to say, credit card details which need to be separately monetized);
- once successfully exploited, there is no way for a victim to reverse or mitigate the damage; and
- the attack can be performed without writing a single line of code.
The vulnerability — which has since been fixed, or else I would not be writing about it publicly — was in Stripe's bitcoin payment functionality. Some background for readers not familiar with this: Stripe provides payment processing services, originally for credit cards but now also supporting ACH, Apple Pay, Alipay, and Bitcoin, and was designed to be the payment platform which developers would want to use; in very much the way that Amazon fixed the computing infrastructure problem with S3 and EC2 by presenting storage and compute functionality via simple APIs, Stripe fixed the "getting money from customers online" problem. I use Stripe at my startup, Tarsnap, and was in fact the first user of Stripe's support for Bitcoin payments: Tarsnap has an unusually geeky and privacy-conscious user base, so this functionality was quite popular among Tarsnap users.
Despite being eager to accept Bitcoin payments, I don't want to actually handle bitcoins; Tarsnap's services are priced in US dollars, and that's what I ultimately want to receive. Stripe abstracts this away for me: I tell Stripe that I want $X, and it tells me how many bitcoins my customer should send and to what address; when the bitcoin turns up, I get the US dollars I asked for. Naturally, since the exchange rate between dollars and bitcoins fluctuates, Stripe can't guarantee the exchange rate forever; instead, they guarantee the rate for 10 minutes (presumably they figured out that the exchange rate volatility is low enough that they won't lose much money over the course of 10 minutes). If the "bitcoin receiver" isn't filled within 10 minutes, incoming coins are converted at the current exchange rate.
For a variety of reasons, it is sometimes necessary to refund bitcoin transactions: For example, a customer cancelling their order; accidentally sending in the wrong number of bitcoins; or even sending in the correct number of bitcoins, but not within the requisite time window, resulting in their value being lower than necessary. Consequently, Stripe allows for bitcoin transactions to be refunded — with the caveat that, for obvious reasons, Stripe refunds the same value of bitcoins, not the same number of bitcoins. (This is analogous to currency exchange issues with credit cards — if you use a Canadian dollar credit card to buy something in US dollars and then get a refund later, the equal USD amount will typically not translate to an equal number of CAD refunded to your credit card.)
The vulnerability lay in the exchange rate handling. As I mentioned above, Stripe guarantees an exchange rate for 10 minutes; if the requisite number of bitcoins arrive within that window, the exchange rate is locked in. So far so good; but what Stripe did not intend was that the exchange rate was locked in permanently — and applied to any future bitcoins sent to the same address.
This made a very simple attack possible:
- Pay for something using bitcoin.
- Wait until the price of bitcoin drops.
- Send more bitcoins to the address used for the initial payment.
- Ask for a refund of the excess bitcoin.
Needless to say, I reported this to Stripe immediately. Fortunately, their website includes a GPG key and advertises a vulnerability disclosure reward (aka. bug bounty) program; these are two things I recommend that every company does, because they advertise that you take security seriously and help to ensure that when people stumble across vulnerabilities they'll let you know. (As it happens, I had Stripe security's public GPG key already and like them enough that I would have taken the time to report this even without a bounty; but it's important to maximize the odds of receiving vulnerability reports.) Since it was late on a Friday afternoon and I was concerned about how easily this could be exploited, I also hopped onto Stripe's IRC channel to ask one of the Stripe employees there to relay a message to their security team: "Check your email before you go home!"
Stripe's handling of this issue was exemplary. They responded promptly to confirm that they had received my report and reproduced the issue locally; and a few days later followed up to let me know that they had tracked down the code responsible for this misbehaviour and that it had been fixed. They also awarded me a bug bounty — one significantly in excess of the $500 they advertise, too.
As I remarked six years ago, Isaac Asimov's remark that in science "Eureka!" is less exciting than "That's funny..." applies equally to security vulnerabilities. I didn't notice this issue because I was looking for ways to exploit bitcoin exchange rates; I noticed it because a Tarsnap customer accidentally sent bitcoins to an old address and the number of coins he got back when I clicked "refund" was significantly less than what he had sent in. (Stripe has corrected this "anti-exploitation" of the vulnerability.) It's important to keep your eyes open; and it's important to encourage your customers to keep their eyes open, which is the largest advantage of bug bounty programs — and why Tarsnap's bug bounty program offers rewards for all bugs, not just those which turn out to be vulnerabilities.
And if you have code which handles fluctuating exchange rates... now might be a good time to double-check that you're always using the right exchange rates.
EC2's most dangerous feature
As a FreeBSD developer — and someone who writes in C — I believe strongly in the idea of "tools, not policy". If you want to shoot yourself in the foot, I'll help you deliver the bullet to your foot as efficiently and reliably as possible. UNIX has always been built around the idea that systems administrators are better equipped to figure out what they want than the developers of the OS, and it's almost impossible to prevent foot-shooting without also limiting useful functionality. The most powerful tools are inevitably dangerous, and often the best solution is to simply ensure that they come with sufficient warning labels attached; but occasionally I see tools which not only lack important warning labels, but are also designed in a way which makes them far more dangerous than necessary. Such a case is IAM Roles for Amazon EC2.A review for readers unfamiliar with this feature: Amazon IAM (Identity and Access Management) is a service which allows for the creation of access credentials which are limited in scope; for example, you can have keys which can read objects from Amazon S3 but cannot write any objects. IAM Roles for EC2 are a mechanism for automatically creating such credentials and distributing them to EC2 instances; you specify a policy and launch an EC2 instance with that Role attached, and magic happens making time-limited credentials available via the EC2 instance metadata. This simplifies the task of creating and distributing credentials and is very convenient; I use it in my FreeBSD AMI Builder AMI, for example. Despite being convenient, there are two rather scary problems with this feature which severely limit the situations where I'd recommend using it.
The first problem is one of configuration: The language used to specify IAM Policies is not sufficient to allow for EC2 instances to be properly limited in their powers. For example, suppose you want to allow EC2 instances to create, attach, detach, and delete Elastic Block Store volumes automatically — useful if you want to have filesystems automatically scaling up and down depending on the amount of data which they contain. The obvious way to do this is would be to "tag" the volumes belonging to an EC2 instance and provide a Role which can only act on volumes tagged to the instance where the Role was provided; while the second part of this (limiting actions to tagged volumes) seems to be possible, there is no way to require specific API call parameters on all permitted CreateVolume calls, as would be necessary to require that a tag is applied to any new volumes being created by the instance. (There also seems to be a gap in the CreateVolume API call, in that it is documented as returning a list of tags assigned to the newly created volume, but does not advertise support for assigning tags as part of the process of creating a volume; but that at least could be easily fixed, and I'm not even sure if this is a failing in the API call or in the documentation of the API call.) The difficulty, and sometimes impossibility, of writing appropriately fine-grained IAM Policies results in a proliferation of overbroad policies; for example, the pre-written Policy which Amazon provides for allowing the Simple Systems Manager agent to make the API calls it requires (AmazonEC2RoleforSSM) permits it to GET and PUT any object in S3 — a terrifying proposition if you store data in S3 which you don't want to see made available to all of your "managed" EC2 instances.
As problematic as the configuration is, a far larger problem with IAM Roles for Amazon EC2 is access control — or, to be more precise, the lack thereof. As I mentioned earlier, IAM Role credentials are exposed to EC2 instances via the EC2 instance metadata system: In other words, they're available from http://169.254.169.254/. (I presume that the "EC2ws" HTTP server which responds is running in another Xen domain on the same physical hardware, but that implementation detail is unimportant.) This makes the credentials easy for programs to obtain... unfortunately, too easy for programs to obtain. UNIX is designed as a multi-user operating system, with multiple users and groups and permission flags and often even more sophisticated ACLs — but there are very few systems which control the ability to make outgoing HTTP requests. We write software which relies on privilege separation to reduce the likelihood that a bug will result in a full system compromise; but if a process which is running as user nobody and chrooted into /var/empty is still able to fetch AWS keys which can read every one of the objects you have stored in S3, do you really have any meaningful privilege separation? To borrow a phrase from Ted Unangst, the way that IAM Roles expose credentials to EC2 instances makes them a very effective exploit mitigation mitigation technique.
To make it worse, exposing credentials — and other metadata, for that matter — via HTTP is completely unnecessary. EC2 runs on Xen, which already has a perfectly good key-value data store for conveying metadata between the host and guest instances. It would be absolutely trivial for Amazon to place EC2 metadata, including IAM credentials, into XenStore; and almost as trivial for EC2 instances to expose XenStore as a filesystem to which standard UNIX permissions could be applied, providing IAM Role credentials with the full range of access control functionality which UNIX affords to files stored on disk. Of course, there is a lot of code out there which relies on fetching EC2 instance metadata over HTTP, and trivial or not it would still take time to write code for pushing EC2 metadata into XenStore and exposing it via a filesystem inside instances; so even if someone at AWS reads this blog post and immediately says "hey, we should fix this", I'm sure we'll be stuck with the problems in IAM Roles for years to come.
So consider this a warning label: IAM Roles for EC2 may seem like a gun which you can use to efficiently and reliably shoot yourself in the foot; but in fact it's more like a gun which is difficult to aim and might be fired by someone on the other side of the room snapping his fingers. Handle with care!
FreeBSD/EC2 11.0-RELEASE
FreeBSD 11.0-RELEASE is just around the corner, and it will be bringing a long list of new features and improvements — far too many for me to list here. But as part of the release process, the FreeBSD release engineering team has built images for Amazon EC2, and as semi-official maintainer of that platform (I've never been appointed to this role, but I've been doing it for years and nobody has told me to stop...) I think there are some improvements in FreeBSD 11.0 which are particularly noteworthy for EC2 users.First, the EC2 Console Screenshot functionality now works with FreeBSD. This provides a "VGA" output as opposed to the traditional "serial port" which EC2 has exposed as "console output" for the past decade, and is useful largely because the "VGA" output becomes available immediately whereas the "serial port" output can lag by several minutes. This improvement is a simple configuration change — older releases didn't waste time writing to a non-serial console because it didn't go anywhere until Amazon added support on their side — and can be enabled on older FreeBSD releases by changing the line console="comconsole" to boot_multicons="YES" in /boot/loader.conf.
The second notable change is support for EC2 "Enhanced Networking" using Intel 82599 hardware; on the C3, C4, R3, I2, D2, and M4 (excluding m4.16xlarge) families, this provides increased network throughput and reduced latency and jitter, since it allows FreeBSD to talk directly to the networking hardware rather than via a Xen paravirtual interface. Getting this working took much longer than I had hoped, but the final problem turned out not to be in FreeBSD at all — we were tickling an interrupt-routing bug in a version of Xen used in EC2. Unfortunately FreeBSD does not yet have support for the new "Elastic Network Adapter" enhanced networking used in P2 and X1 instance families and the m4.16xlarge instance type; I'm hoping that we'll have a driver for that before FreeBSD 11.1 arrives.
The third notable change is an improvement in EC2 disk throughput. This comes thanks to enabling indirect segment I/Os in FreeBSD's blkfront driver; while the support was present in 10.3, I had it turned off by default due to performance anomalies on some EC2 instances. (Those EC2 performance problems have been resolved, and disk I/O performance in EC2 on FreeBSD 10.3 can now be safely improved by removing the line hw.xbd.xbd_enable_indirect="0" from /boot/loader.conf.)
Finally, FreeBSD now supports all 128 CPUs in the x1.32xlarge instance type. This improvement comes thanks to two changes: The FreeBSD default kernel was modified in 2014 to support up to 256 CPUs (up from 64), but that resulted in a (fixed-size) section of preallocated memory being exhausted early in the boot process on systems with 92 or more CPUs; a few months ago I changed that value to tune automatically so that FreeBSD can now boot and not immediately panic with an out-of-the-box setup on such large systems.
I think FreeBSD/EC2 users will be very happy with FreeBSD 11.0-RELEASE; but I'd like to end with an important reminder: No matter what you might see on FTP servers, in EC2, or available via freebsd-update, the new release has not been released until you see a GPG-signed email from the release engineer. This is not just a theoretical point: In my time as a FreeBSD developer I've seen multiple instances of last-minute release re-rolls happening due to problems being discovered very late, so the fact that you can see bits doesn't necessarily mean that they are ready to be downloaded. I hope you're looking forward to 11.0-RELEASE, but please be patient.