Some thoughts on Spectre and Meltdown
By now I imagine that all of my regular readers, and a large proportion of the rest of the world, have heard of the security issues dubbed "Spectre" and "Meltdown". While there have been some excellent technical explanations of these issues from several sources — I particularly recommend the Project Zero blog post — I have yet to see anyone really put these into a broader perspective; nor have I seen anyone make a serious attempt to explain these at a level suited for a wide audience. While I have not been involved with handling these issues directly, I think it's time for me to step up and provide both a wider context and a more broadly understandable explanation.The story of these attacks starts in late 2004. I had submitted my doctoral thesis and had a few months before flying back to Oxford for my defense, so I turned to some light reading: Intel's latest "Optimization Manual", full of tips on how to write faster code. (Eking out every last nanosecond of performance has long been an interest of mine.) Here I found an interesting piece of advice: On Intel CPUs with "Hyper-Threading", a common design choice (aligning the top of thread stacks on page boundaries) should be avoided because it would result in some resources being overused and others being underused, with a resulting drop in performance. This started me thinking: If two programs can hurt each others' performance by accident, one should be able to measure whether its performance is being hurt by the other; if it can measure whether its performance is being hurt by people not following Intel's optimization guidelines, it should be able to measure whether its performance is being hurt by other patterns of resource usage; and if it can measure that, it should be able to make deductions about what the other program is doing.
It took me a few days to convince myself that information could be stolen in this manner, but within a few weeks I was able to steal an RSA private key from OpenSSL. Then started the lengthy process of quietly notifying Intel and all the major operating system vendors; and on Friday the 13th of May 2005 I presented my paper describing this new attack at BSDCan 2005 — the first attack of this type exploiting how a running program causes changes to the microarchitectural state of a CPU. Three months later, the team of Osvik, Shamir, and Tromer published their work, which showed how the same problem could be exploited to steal AES keys. (Note that there were side channel attacks discovered over the preceding years which relied on microarchitectural details; but in those cases information was being revealed by the time taken by the cryptographic operation in question. My work was the first to demonstrate that information could leak from a process into the microarchitectural state and then be extracted from there by another process.)
Over the following years there have been many attacks which expoit different aspects of CPU design — exploiting L1 data cache collisions, exploiting L1 code cache collisions, exploiting L2 cache collisions, exploiting the TLB, exploiting branch prediction, etc. — but they have all followed the same basic mechanism: A program does something which interacts with the internal state of a CPU, and either we can measure that internal state (the more common case) or we can set up that internal state before the program runs in a way which makes the program faster or slower. These new attacks use the same basic mechanism, but exploit an entirely new angle. But before I go into details, let me go back to basics for a moment.
Understanding the attacks
These attacks exploit something called a "side channel". What's a side channel? It's when information is revealed as an inadvertant side effect of what you're doing. For example, in the movie 2001, Bowman and Poole enter a pod to ensure that the HAL 9000 computer cannot hear their conversation — but fail to block the optical channel which allows Hal to read their lips. Side channels are related to a concept called "covert channels": Where side channels are about stealing information which was not intended to be conveyed, covert channels are about conveying information which someone is trying to prevent you from sending. The famous case of a Prisoner of War blinking the word "TORTURE" in Morse code is an example of using a covert channel to convey information.Another example of a side channel — and I'll be elaborating on this example later, so please bear with me if it seems odd — is as follows: I want to know when my girlfriend's passport expires, but she won't show me her passport (she complains that it has a horrible photo) and refuses to tell me the expiry date. I tell her that I'm going to take her to Europe on vacation in August and watch what happens: If she runs out to renew her passport, I know that it will expire before August; while if she doesn't get her passport renewed, I know that it will remain valid beyond that date. Her desire to ensure that her passport would be valid inadvertantly revealed to me some information: Whether its expiry date was before or after August.
Over the past 12 years, people have gotten reasonably good at writing programs which avoid leaking information via side channels; but as the saying goes, if you make something idiot-proof, the world will come up with a better idiot; in this case, the better idiot is newer and faster CPUs. The Spectre and Meltdown attacks make use of something called "speculative execution". This is a mechanism whereby, if a CPU isn't sure what you want it to do next, it will speculatively perform some action. The idea here is that if it guessed right, it will save time later — and if it guessed wrong, it can throw away the work it did and go back to doing what you asked for. As long as it sometimes guesses right, this saves time compared to waiting until it's absolutely certain about what it should be doing next. Unfortunately, as several researchers recently discovered, it can accidentally leak some information during this speculative execution.
Going back to my analogy: I tell my girlfriend that I'm going to take her on vacation in June, but I don't tell her where yet; however, she knows that it will either be somewhere within Canada (for which she doesn't need a passport, since we live in Vancouver) or somewhere in Europe. She knows that it takes time to get a passport renewed, so she checks her passport and (if it was about to expire) gets it renewed just in case I later reveal that I'm going to take her to Europe. If I tell her later that I'm only taking her to Ottawa — well, she didn't need to renew her passport after all, but in the mean time her behaviour has already revealed to me whether her passport was about to expire. This is what Google refers to "variant 1" of the Spectre vulnerability: Even though she didn't need her passport, she made sure it was still valid just in case she was going to need it.
"Variant 2" of the Spectre vulnerability also relies on speculative execution but in a more subtle way. Here, instead of the CPU knowing that there are two possible execution paths and choosing one (or potentially both!) to speculatively execute, the CPU has no idea what code it will need to execute next. However, it has been keeping track and knows what it did the last few times it was in the same position, and it makes a guess — after all, there's no harm in guessing since if it guesses wrong it can just throw away the unneeded work. Continuing our analogy, a "Spectre version 2" attack on my girlfriend would be as follows: I spend a week talking about how Oxford is a wonderful place to visit and I really enjoyed the years I spent there, and then I tell her that I want to take her on vacation. She very reasonably assumes that — since I've been talking about Oxford so much — I must be planning on taking her to England, and runs off to check her passport and potentially renew it... but in fact I tricked her and I'm only planning on taking her to Ottawa.
This "version 2" attack is far more powerful than "version 1" because it can be used to exploit side channels present in many different locations; but it is also much harder to exploit and depends intimately on details of CPU design, since the attacker needs to make the CPU guess the correct (wrong) location to anticipate that it will be visiting next.
Now we get to the third attack, dubbed "Meltdown". This one is a bit weird, so I'm going to start with the analogy here: I tell my girlfriend that I want to take her to the Korean peninsula. She knows that her passport is valid for long enough; but she immediately runs off to check that her North Korean visa hasn't expired. Why does she have a North Korean visa, you ask? Good question. She doesn't — but she runs off to check its expiry date anyway! Because she doesn't have a North Korean visa, she (somehow) checks the expiry date on someone else's North Korean visa, and then (if it is about to expire) runs out to renew it — and so by telling her that I want to take her to Korea for a vacation I find out something she couldn't have told me even if she wanted to. If this sounds like we're falling down a Dodgsonian rabbit hole... well, we are. The most common reaction I've heard from security people about this is "Intel CPUs are doing what???", and it's not by coincidence that one of the names suggested for an early Linux patch was Forcefully Unmap Complete Kernel With Interrupt Trampolines (FUCKWIT). (For the technically-inclined: Intel CPUs continue speculative execution through faults, so the fact that a page of memory cannot be accessed does not prevent it from, well, being accessed.)
How users can protect themselves
So that's what these vulnerabilities are all about; but what can regular users do to protect themselves? To start with, apply the damn patches. For the next few months there are going to be patches to operating systems; patches to individual applications; patches to phones; patches to routers; patches to smart televisions... if you see a notification saying "there are updates which need to be installed", install the updates. (However, this doesn't mean that you should be stupid: If you get an email saying "click here to update your system", it's probably malware.) These attacks are complicated, and need to be fixed in many ways in many different places, so each individual piece of software may have many patches as the authors work their way through from fixing the most easily exploited vulnerabilities to the more obscure theoretical weaknesses.What else can you do? Understand the implications of these vulnerabilities. Intel caught some undeserved flak for stating that they believe "these exploits do not have the potential to corrupt, modify or delete data"; in fact, they're quite correct in a direct sense, and this distinction is very relevant. A side channel attack inherently reveals information, but it does not by itself allow someone to take control of a system. (In some cases side channels may make it easier to take advantage of other bugs, however.) As such, it's important to consider what information could be revealed: Even if you're not working on top secret plans for responding to a ballistic missile attack, you've probably accessed password-protected websites (Facebook, Twitter, Gmail, perhaps your online banking...) and possibly entered your credit card details somewhere today. Those passwords and credit card numbers are what you should worry about.
Now, in order for you to be attacked, some code needs to run on your computer. The most likely vector for such an attack is through a website — and the more shady the website the more likely you'll be attacked. (Why? Because if the owners of a website are already doing something which is illegal — say, selling fake prescription drugs — they're far more likely to agree if someone offers to pay them to add some "harmless" extra code to their site.) You're not likely to get attacked by visiting your bank's website; but if you make a practice of visiting the less reputable parts of the World Wide Web, it's probably best to not log in to your bank's website at the same time. Remember, this attack won't allow someone to take over your computer — all they can do is get access to information which is in your computer's memory at the time they carry out the attack.
For greater paranoia, avoid accessing suspicious websites after you handle any sensitive information (including accessing password-protected websites or entering your credit card details). It's possible for this information to linger in your computer's memory even after it isn't needed — it will stay there until it's overwritten, usually because the memory is needed for something else — so if you want to be safe you should reboot your computer in between.
For maximum paranoia: Don't connect to the internet from systems you care about. In the industry we refer to "airgapped" systems; this is a reference back to the days when connecting to a network required wires, so if there was a literal gap with just air between two systems, there was no way they could communicate. These days, with ubiquitous wifi (and in many devices, access to mobile phone networks) the terminology is in need of updating; but if you place devices into "airplane" mode it's unlikely that they'll be at any risk. Mind you, they won't be nearly as useful — there's almost always a tradeoff between security and usability, but if you're handling something really sensitive, you may want to consider this option. (For my Tarsnap online backup service I compile and cryptographically sign the packages on a system which has never been connected to the Internet. Before I turned it on for the first time, I opened up the case and pulled out the wifi card; and I copy files on and off the system on a USB stick. Tarsnap's slogan, by the way, is "Online backups for the truly paranoid".)
How developers can protect everyone
The patches being developed and distributed by operating systems — including microcode updates from Intel — will help a lot, but there are still steps individual developers can take to reduce the risk of their code being exploited.First, practice good "cryptographic hygiene": Information which isn't in memory can't be stolen this way. If you have a set of cryptographic keys, load only the keys you need for the operations you will be performing. If you take a password, use it as quickly as possible and then immediately wipe it from memory. This isn't always possible, especially if you're using a high level language which doesn't give you access to low level details of pointers and memory allocation; but there's at least a chance that it will help.
Second, offload sensitive operations — especially cryptographic operations — to other processes. The security community has become more aware of privilege separation over the past two decades; but we need to go further than this, to separation of information — even if two processes need exactly the same operating system permissions, it can be valuable to keep them separate in order to avoid information from one process leaking via a side channel attack against the other.
One common design paradigm I've seen recently is to "TLS all the things", with a wide range of applications gaining understanding of the TLS protocol layer. This is something I've objected to in the past as it results in unnecessary exposure of applications to vulnerabilities in the TLS stacks they use; side channel attacks provide another reason, namely the unnecessary exposure of the TLS stack to side channels in the application. If you want to add TLS to your application, don't add it to the application itself; rather, use a separate process to wrap and unwrap connections with TLS, and have your application take unencrypted connections over a local (unix) socket or a loopback TCP/IP connection.
Separating code into multiple processes isn't always practical, however, for reasons of both performance and practical matters of code design. I've been considering (since long before these issues became public) another form of mitigation: Userland page unmapping. In many cases programs have data structures which are "private" to a small number of source files; for example, a random number generator will have internal state which is only accessed from within a single file (with appropriate functions for inputting entropy and outputting random numbers), and a hash table library would have a data structure which is allocated, modified, accessed, and finally freed only by that library via appropriate accessor functions. If these memory allocations can be corralled into a subset of the system address space, and the pages in question only mapped upon entering those specific routines, it could dramatically reduce the risk of information being revealed as a result of vulnerabilities which — like these side channel attacks — are limited to leaking information but cannot be (directly) used to execute arbitrary code.
Finally, developers need to get better at providing patches: Not just to get patches out promptly, but also to get them into users' hands and to convince users to install them. That last part requires building up trust; as I wrote last year, one of the worst problems facing the industry is the mixing of security and non-security updates. If users are worried that they'll lose features (or gain "features" they don't want), they won't install the updates you recommend; it's essential to give users the option of getting security patches without worrying about whether anything else they rely upon will change.
What's next?
So far we've seen three attacks demonstrated: Two variants of Spectre and one form of Meltdown. Get ready to see more over the coming months and years. Off the top of my head, there are four vulnerability classes I expect to see demonstrated before long:- Attacks on p-code interpreters. Google's "Variant 1" demonstrated an attack where a conditional branch was mispredicted resulting in a bounds check being bypassed; but the same problem could easily occur with mispredicted branches in a switch statement resulting in the wrong operation being performed on a valid address. On p-code machines which have an opcode for "jump to this address, which contains machine code" (not entirely unlikely in the case of bytecode machines which automatically transpile "hot spots" into host machine code), this could very easily be exploited as a "speculatively execute attacker-provided code" mechanism.
- Structure deserializing. This sort of code handles attacker-provided inputs which often include the lengths or numbers of fields in a structure, along with bounds checks to ensure the validity of the serialized structure. This is prime territory for a CPU to speculatively reach past the end of the input provided if it mispredicts the layout of the structure.
- Decompressors, especially in HTTP(S) stacks. Data decompression inherently involves a large number of steps of "look up X in a table to get the length of a symbol, then adjust pointers and perform more memory accesses" — exactly the sort of behaviour which can leak information via cache side channels if a branch mispredict results in X being speculatively looked up in the wrong table. Add attacker-controlled inputs to HTTP stacks and the fact that services speaking HTTP are often required to perform request authentication and/or include TLS stacks, and you have all the conditions needed for sensitive information to be leaked.
- Remote attacks. As far as I'm aware, all of the microarchitectural side channels demonstrated over the past 14 years have made use of "attack code" running on the system in question to observe the state of the caches or other microarchitectural details in order to extract the desired data. This makes attacks far easier, but should not be considered to be a prerequisite! Remote timing attacks are feasible, and I am confident that we will see a demonstration of "innocent" code being used for the task of extracting the microarchitectural state information before long. (Indeed, I think it is very likely that certain people are already making use of such remote microarchitectural side channel attacks.)
Final thoughts on vulnerability disclosure
The way these issues were handled was a mess; frankly, I expected better of Google, I expected better of Intel, and I expected better of the Linux community. When I found that Hyper-Threading was easily exploitable, I spent five months notifying the security community and preparing everyone for my announcement of the vulnerability; but when the embargo ended at midnight UTC and FreeBSD published its advisory a few minutes later, the broader world was taken entirely by surprise. Nobody knew what was coming aside from the people who needed to know; and the people who needed to know had months of warning.Contrast that with what happened this time around. Google discovered a problem and reported it to Intel, AMD, and ARM on June 1st. Did they then go around contacting all of the operating systems which would need to work on fixes for this? Not even close. FreeBSD was notified the week before Christmas, over six months after the vulnerabilities were discovered. Now, FreeBSD can occasionally respond very quickly to security vulnerabilities, even when they arise at inconvenient times — on November 30th 2009 a vulnerability was reported at 22:12 UTC, and on December 1st I provided a patch at 01:20 UTC, barely over 3 hours later — but that was an extremely simple bug which needed only a few lines of code to fix; the Spectre and Meltdown issues are orders of magnitude more complex.
To make things worse, the Linux community was notified and couldn't keep their mouths shut. Standard practice for multi-vendor advisories like this is that an embargo date is set, and nobody does anything publicly prior to that date. People don't publish advisories; they don't commit patches into their public source code repositories; and they definitely don't engage in arguments on public mailing lists about whether the patches are needed for different CPUs. As a result, despite an embargo date being set for January 9th, by January 4th anyone who cared knew about the issues and there was code being passed around on Twitter for exploiting them.
This is not the first time I've seen people get sloppy with embargoes recently, but it's by far the worst case. As an industry we pride ourselves on the concept of responsible disclosure — ensuring that people are notified in time to prepare fixes before an issue is disclosed publicly — but in this case there was far too much disclosure and nowhere near enough responsibility. We can do better, and I sincerely hope that next time we do.