Introduction
Networks are fun to work with, but often they are also a source of trouble. Network troubleshooting can be difficult, and reproducing the bad behavior that is happening in the field can be painful as well.
Luckily, there are some tools that come to the aid: network namespaces, virtual machines, tc
, and netfilter
. Simple network setups can be reproduced with network namespaces and veth
devices, while more-complex setups require interconnecting virtual machines with a software bridge and using standard networking tools, like iptables
or tc
, to simulate the bad behavior. If you have an issue with ICMP replies generated because an SSH server is down, iptables -A INPUT -p tcp --dport 22 -j REJECT --reject-with icmp-host-unreachable
in the correct namespace or VM can do the trick.
This article describes using eBPF (extended BPF), an extended version of the Berkeley Packet Filter, to troubleshoot complex network issues. eBPF is a fairly new technology and the project is still in an early stage, with documentation and the SDK not yet ready. But that should improve, especially with XDP (eXpress Data Path) being shipped in Red Hat Enterprise Linux 8, which you can download and run now.
While eBPF is not a silver bullet, I think it is a very powerful tool for network debugging and it deserves attention. I am sure it will play a really important role in the future of networks.
The problem
I was debugging an Open vSwitch (OVS) network issue affecting a very complex installation: some TCP packets were scrambled and delivered out of order, and the throughput between VMs was dropping from a sustained 6 Gb/s to an oscillating 2–4 Gb/s. After some analysis, it turned out that the first TCP packet of every connection with the PSH flag set was sent out of order: only the first one, and only one per connection.
I tried to replicate the setup with two VMs, and after many man pages and internet searches, I discovered that both iptables
and nftables
can't mangle TCP flags, while tc
could, but it can only overwrite the flags, breaking new connections and TCP in general.
Probably I could have dealt with it using a combination of iptables
mark, conntrack
, and tc
, but then I thought: this could be a job for eBPF.
What is eBPF?
eBPF is an extended version of the Berkeley Packet Filter. It adds many improvements to BPF; most notably, it allows writing memory instead of just reading it, so it can also edit packets in addition to filtering them.
eBPF is often referred to as BPF, while BPF is referred to as cBPF (classic BPF), so the word BPF can be used to represent both, depending on the context: here, I'm always referring to the extended version.
Under the hood, eBPF uses a very simple bytecode VM that can execute small portions of bytecode and edit some in-memory buffers. eBPF comes with some limitations, to prevent it from being used maliciously:
- Cycles are forbidden, so the program will exit in a definite time.
- It can't access memory other than the stack and a scratch buffer.
- Only kernel functions in a whitelist can be called.
The loaded program can be loaded in the kernel in many ways, doing a plethora of debugging and tracing. In this case, we are interested in how eBPF works with the networking subsystem. There are two ways to use an eBPF program:
- Attached via XDP to the very early RX path of a physical or virtual NIC
- Attached via
tc
to a qdisc just like a normal action, in ingress or egress
In order to create an eBPF program to attach, it is enough to write some C code and convert it into bytecode. Below a simple example using XDP:
SEC("prog") int xdp_main(struct xdp_md *ctx) { void *data_end = (void *)(uintptr_t)ctx->data_end; void *data = (void *)(uintptr_t)ctx->data; struct ethhdr *eth = data; struct iphdr *iph = (struct iphdr *)(eth + 1); struct icmphdr *icmph = (struct icmphdr *)(iph + 1); /* sanity check needed by the eBPF verifier */ if (icmph + 1 > data_end) return XDP_PASS; /* matched a pong packet */ if (eth->h_proto != ntohs(ETH_P_IP) || iph->protocol != IPPROTO_ICMP || icmph->type != ICMP_ECHOREPLY) return XDP_PASS; if (iph->ttl) { /* save the old TTL to recalculate the checksum */ uint16_t *ttlproto = (uint16_t *)&iph->ttl; uint16_t old_ttlproto = *ttlproto; /* set the TTL to a pseudorandom number 1 < x < TTL */ iph->ttl = bpf_get_prandom_u32() % iph->ttl + 1; /* recalculate the checksum; otherwise, the IP stack will drop it */ csum_replace2(&iph->check, old_ttlproto, *ttlproto); } return XDP_PASS; } char _license[] SEC("license") = "GPL";
The snippet above, stripped of include
statements, helpers, and all the not-necessary code, is an XDP program that changes the TTL of received ICMP echo replies, namely pongs, to a random number. The main function receives a struct xdp_md
, which contains two pointers to the packet start and end.
To compile our code into eBPF bytecode, a compiler with support for it is needed. Clang supports it and produces eBPF bytecode by specifying bpf
as the target at compile time:
$ clang -O2 -target bpf -c xdp_manglepong.c -o xdp_manglepong.o
The command above produces a file that seems to be a regular object file, but if inspected, you'll see that the reported machine type will be Linux eBPF
rather than the native one of the OS:
$ readelf -h xdp_manglepong.o ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: REL (Relocatable file) Machine: Linux BPF <--- HERE [...]
Once wrapped in a regular object file, the eBPF program is ready to be loaded and attached to the device via XDP. This can be done using ip
, from the iproute2
suite, using the following syntax:
# ip -force link set dev wlan0 xdp object xdp_manglepong.o verbose
This command specified the target interface wlan0
, and with the -force
option, it will overwrite any existing eBPF code already loaded. After loading the eBPF bytecode, this is the system behavior:
$ ping -c10 192.168.85.1 PING 192.168.85.1 (192.168.85.1) 56(84) bytes of data. 64 bytes from 192.168.85.1: icmp_seq=1 ttl=41 time=0.929 ms 64 bytes from 192.168.85.1: icmp_seq=2 ttl=7 time=0.954 ms 64 bytes from 192.168.85.1: icmp_seq=3 ttl=17 time=0.944 ms 64 bytes from 192.168.85.1: icmp_seq=4 ttl=64 time=0.948 ms 64 bytes from 192.168.85.1: icmp_seq=5 ttl=9 time=0.803 ms 64 bytes from 192.168.85.1: icmp_seq=6 ttl=22 time=0.780 ms 64 bytes from 192.168.85.1: icmp_seq=7 ttl=32 time=0.847 ms 64 bytes from 192.168.85.1: icmp_seq=8 ttl=50 time=0.750 ms 64 bytes from 192.168.85.1: icmp_seq=9 ttl=24 time=0.744 ms 64 bytes from 192.168.85.1: icmp_seq=10 ttl=42 time=0.791 ms --- 192.168.85.1 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 125ms rtt min/avg/max/mdev = 0.744/0.849/0.954/0.082 ms
Every packet received goes through eBPF, which eventually does some transformation and decides to drop or let the packet pass.
How eBPF can help
Going back to the original network issue, I needed to mangle some TCP flags, only one per connection, and neither iptables
nor tc
allow doing that. Writing C code for this scenario would be very easy: set up two VMs linked by an OVS bridge and simply attach eBPF to one of the two VM virtual devices.
This looks like a nice solution, but you must take into account that XDP only supports handling of received packets, and attaching eBPF in the rx
path of the receiving VM will have no effect on the switch.
To properly address this, eBPF has to be loaded using tc
and attached in the egress path within the VM, as tc
can load and attach eBPF programs to a qdisc just like any other action. In order to mangle packets leaving the host, an egress qdisc is needed to attach eBPF to.
There are small differences between the XDP
and tc
API when loading an eBPF program: the default section name differs, the argument of the main function has a different structure type, and the returned values are different, but this is not a big issue. Below is a snippet of a program that does TCP mangling when attached to a tc
action:
#define RATIO 10 SEC("action") int bpf_main(struct __sk_buff *skb) { void *data = (void *)(uintptr_t)skb->data; void *data_end = (void *)(uintptr_t)skb->data_end; struct ethhdr *eth = data; struct iphdr *iph = (struct iphdr *)(eth + 1); struct tcphdr *tcphdr = (struct tcphdr *)(iph + 1); /* sanity check needed by the eBPF verifier */ if ((void *)(tcphdr + 1) > data_end) return TC_ACT_OK; /* skip non-TCP packets */ if (eth->h_proto != __constant_htons(ETH_P_IP) || iph->protocol != IPPROTO_TCP) return TC_ACT_OK; /* incompatible flags, or PSH already set */ if (tcphdr->syn || tcphdr->fin || tcphdr->rst || tcphdr->psh) return TC_ACT_OK; if (bpf_get_prandom_u32() % RATIO == 0) tcphdr->psh = 1; return TC_ACT_OK; } char _license[] SEC("license") = "GPL";
The compilation into bytecode is done as in the XDP example before via the following:
clang -O2 -target bpf -c tcp_psh.c -o tcp_psh.o
But the loading is different:
# tc qdisc add dev eth0 clsact # tc filter add dev eth0 egress matchall action bpf object-file tcp_psh.o
At this point, eBPF is loaded in the right place and packets leaving the VM are mangled. By checking the received packets from the second VM, you can see the following:
# tcpdump -tnni eth0 -Q in [1579537.890082] device eth0 entered promiscuous mode tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 809667041:809681521, ack 3046223642, length 14480 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 14480:43440, ack 1, length 28960 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 43440:101360, ack 1, length 57920 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [P.], seq 101360:131072, ack 1, length 29712 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 131072:145552, ack 1, length 14480 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 145552:174512, ack 1, length 28960 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 174512:210712, ack 1, length 36200 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 210712:232432, ack 1, length 21720 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 232432:246912, ack 1, length 14480 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [P.], seq 246912:262144, ack 1, length 15232 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 262144:276624, ack 1, length 14480 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 276624:305584, ack 1, length 28960 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 305584:363504, ack 1, length 57920 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [P.], seq 363504:393216, ack 1, length 29712 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 393216:407696, ack 1, length 14480 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 407696:436656, ack 1, length 28960 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 436656:494576, ack 1, length 57920 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [P.], seq 494576:524288, ack 1, length 29712 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 524288:538768, ack 1, length 14480 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 538768:567728, ack 1, length 28960 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 567728:625648, ack 1, length 57920 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [.], seq 625648:627096, ack 1, length 1448 IP 192.168.123.1.39252 > 192.168.123.2.5201: Flags [P.], seq 627096:655360, ack 1, length 28264
tcpdump
confirms that the new eBPF code is working, and about 1 of every 10 TCP packets has the PSH flag set. With just 20 lines of C code, we selectively mangled the TCP packets leaving a VM, replicating an error that happened in the field, all without recompiling any driver and without even rebooting! This simplified a lot the validation of the Open vSwitch fix in a manner that was impossible to do with other tools.
Conclusion
eBPF is a fairly new technology, and the community has strong opinions about its adoption. It's also worth noting that eBPF-based projects like bpfilter are becoming more popular, and as consequence, various hardware vendors are starting to implement eBPF support directly in their NICs.
While eBPF is not a silver bullet and should not be abused, I think it is a very powerful tool for network debugging and it deserves attention. I am sure it will play a really important role in the future of networks.
Download Red Hat Enterprise Linux 8 and try eBPF.