Some notes on userspace routing
For reasons which will be immediately apparent to anyone who has read my earlier blog post about the EC2 Instances Metadata Service (and its use by IAM Roles), I recently decided that I wanted to intercept outgoing IP packets which had a destination of 169.254.169.254; in some cases I want to redirect or block them, and in other cases I want them to proceed unimpeded. To make things harder, I had two more constraints:- I don't want to write any new kernel code, since venturing into the kernel introduces a much wider range of potential adverse outcomes if my code is buggy, and
- I don't want to make use of firewalls, since users might have their own firewall rulesets which could conflict with EC2 IMDS-filtering rules; also, traversing a firewall — even one with a trivial ruleset — has a cost which can become nontrivial for the sort of high-bandwith applications which FreeBSD excels at. (For the same reasons, I'm less than enthusiastic about the suggestion in Amazon's documentation that users consider using local firewall rules to restrict access to the Instance Metadata Service.)
While I like to consider myself an experienced FreeBSD developer, networking is not my area of expertise; so I spent a significant amount of time flailing wildly and reading often wildly-out-of-date documentation while trying to figure this out. In the hope of helping the next person who wants to do something like this, here's some notes about what worked and what didn't.
Redirecting traffic
I want outgoing network traffic to 169.254.169.254 to land at a daemon of my choice rather than passing through the external network interface, so my first thought was to redirect traffic into a FreeBSD jail with a virtualized network stack; this stack could use network address translation to rewrite the packets to send traffic to a daemon bound to a convenient IP address (say, 1.1.1.1). How do I get that traffic into a jail, though?The immediately obvious answer here was to use an epair — after all, it is "a pair of network interfaces, connected with a virtual network cable". What I didn't realize for far too long is that since an epair is a pair of ethernet interfaces, it behaves like ethernet — it's a layer 2 connection, not a layer 3 connection, so you can't simply say "route these IP packets into that connection". Instead, you need to say "route these IP packets via that gateway which is on that connection"; while this could work, I decided that it was unnecessarily complicated.
Instead the solution I opted for was to create a tun device; this is a layer 3 endpoint, which IP packets can be routed into easily and then read by a process which opens the device node in question. Since I wanted to inspect packets from userland, this provided for a simple solution; I can drop packets, rewrite packets (e.g., to change their destination IPs), and I can inject them back into the system by writing them to a tun device.
Getting packets out
The harder problem turned out to be one I hadn't even considered: Once I have outgoing packets to 169.254.169.254 being redirected to me, how do I allow some packets — say, EC2 Instance Metadata Service requests which come from root — to get out rather than simply coming right back to me?My first thought was to have outgoing requests made from inside the aforementioned vnet jail. Indeed, if Amazon took the sensible approach of separating "control" and "data" network interfaces, this would have worked very well: I could have placed the "control" interface into the vnet jail where it could be easily segregated from potentially vulnerable code. Alas, Amazon exposes the Instance Metadata Service via the same network interface as regular IP traffic uses; and making requests to the service from a fictitious IP address (e.g. 1.1.1.1) doesn't work since EC2 (by default) filters outgoing packets using their source IP addresses. By default, you only get one IP address, and you can't make requests from any others.
My second attempt was to take packets via a tun interface and forward them out the external interface by using the SO_DONTROUTE socket option — effectively bypassing the routing rule which was redirecting packets which had a destination of 169.254.169.254. Unfortunately, FreeBSD's implementation of raw sockets doesn't respect SO_DONTROUTE; it determines the appropriate output interface for packets based on their destination IP addresses, regardless of any instructions to the contrary.
My third thought was to send the packets from inside a vnet jail, in order to bypass the routing rule I had set up on the host, and then bridge a tap interface from the jail with the EC2 instance's external interface. This was able to send packets out of the interface with the right source IP addresses... but I ran into another problem, namely that EC2 also filters ethernet segments based on their source MAC addresses.
Attempt number four: Do the same thing, but set the MAC address on the jailed tap interface to match the external interface, in order to have segments pass EC2's filtering. This stumped me for a long time, but I eventually figured out why it wasn't working: If you bridge two interfaces, FreeBSD forwards ethernet segments between them not based on which interface the segment entered the bridge from, but rather based on the segment's source MAC address — so you can't "forge" outgoing segments via a bridge.
My fifth thought took me back to using raw sockets: Rather than trying to use SO_DONTROUTE with IP packets, could I use a raw socket to write ethernet segments? This was a short-lived hope: While some operating systems support SOCK_RAW with PF_LINK, FreeBSD does not; you can't use raw sockets to send ethernet segments.
Finally on my sixth attempt I figured out the right way to do this. While BPF is best known as a mechanism for reading packets from the network, it can also be used to write packets — in effect, BPF provides the "write raw ethernet segments to a network interface" support which FreeBSD doesn't implement with SOCK_RAW.
What next?
I now have code which can intercept outgoing connection to the EC2 Instance Metadata Service, forward some of the packets into a vnet jail, and forward other packets out the external interface. The rest is straightforward: Write a daemon which handles the intercepted connections, and decide which packets go in which direction.With luck, I'll have something ready to demonstrate at AWS re:Invent — if you're interested, look for me there!