One word that is thrown around a lot about eBPF is powerful. But what does this mean in practice? Well, it means that eBPF makes many things possible that were previously either completely impossible or at least very cumbersome. In this article, we will do a deep dive into a practical example of this power, showing off eBPF’s ability to poke deep within the Linux kernel internals to answer questions about the state of the running kernel.
The subject of our investigation is netfilter, the firewalling subsystem in the Linux kernel, which is the underlying technology that implements the firewall capabilities underneath the older iptables firewall interface and the newer nftables. Even with higher-level utilities, such as firewalld (used in Red Hat Enterprise Linux and Fedora), the kernel netfilter subsystem still performs the underlying filtering.
Which rule caused the drop
We will start with a concrete question: Which of the rules in my firewall ruleset caused the drop of a particular packet?
The goal of this article is to demonstrate how to answer this question. We can use this same approach to answer a myriad of questions about the kernel and to hook into pretty much any arbitrary point inside the kernel.
Let’s dive in by first looking at what a netfilter ruleset actually looks like.
A netfilter ruleset example
For the purpose of this investigation, we will use a simple nftables ruleset with just a single input chain that processes all incoming traffic, similar to the simple nftables example ruleset. This ruleset is much simpler than something produced by firewalld, but it makes it easier to determine what’s going on.
We can ask the nft binary to dump the ruleset, including the handles for each rule (-a), which we can use to identify them later.
$ sudo nft -a list ruleset
table inet filter { # handle 6
chain input { # handle 1
type filter hook input priority filter; policy drop;
ct state { established, related } accept # handle 5
ct state invalid drop # handle 6
tcp dport 10000 drop # handle 7
iifname "lo" accept # handle 8
iifname "veth0" accept # handle 9
iifname "eth0" accept # handle 10
ip protocol icmp accept # handle 11
ip6 nexthdr ipv6-icmp icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem, echo-request, echo-reply, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert } accept # handle 13
}
chain forward { # handle 2
type filter hook forward priority filter; policy drop;
}
chain output { # handle 3
type filter hook output priority filter; policy accept;
}
}Looking at the input chain, each line (after the initial line that starts with type filter) is a rule that will execute in sequence. If a rule matches the packet and contains a verdict (accept or drop), processing of further rules will stop, rendering that verdict as the final one for the packet. We want to be able to hook into the processing, and whenever a drop verdict is reached, look at the packet data (to know which packet it was and to identify the flow it belongs to). We also need to know which rule was the cause of the drop verdict, so we can troubleshoot our ruleset.
As you can see from the previous output, each rule has a handle that identifies it with its chain. So whenever a particular packet is dropped, we can find the handle of the rule that dropped it. We’ll use the (redundant) rule with handle 7 as a test rule just for this experiment.
Finding a place to hook into the kernel
To find a good place for our hook, we need to dive into the kernel source code and figure out how the internals work. Once we have figured that out, we can turn the knowledge into an eBPF probe that will look at the currently running kernel.
The main kernel function that executes a netfilter chain is nft_do_chain() which lives in net/netfilter/nf_tables_core.c and looks like this:
unsigned int
nft_do_chain(struct nft_pktinfo *pkt, void *priv)
{
const struct nft_chain *chain = priv, *basechain = chain;
const struct nft_rule_dp *rule, *last_rule;
const struct net *net = nft_net(pkt);
const struct nft_expr *expr, *last;
struct nft_regs regs = {};
unsigned int stackptr = 0;
struct nft_jumpstack jumpstack[NFT_JUMP_STACK_SIZE];
bool genbit = READ_ONCE(net->nft.gencursor);
struct nft_rule_blob *blob;
struct nft_traceinfo info;
info.trace = false;
if (static_branch_unlikely(&nft_trace_enabled))
nft_trace_init(&info, pkt, ®s.verdict, basechain);
do_chain:
if (genbit)
blob = rcu_dereference(chain->blob_gen_1);
else
blob = rcu_dereference(chain->blob_gen_0);
rule = (struct nft_rule_dp *)blob->data;
last_rule = (void *)blob->data + blob->size;
next_rule:
regs.verdict.code = NFT_CONTINUE;
for (; rule < last_rule; rule = nft_rule_next(rule)) {
nft_rule_dp_for_each_expr(expr, last, rule) {
if (expr->ops == &nft_cmp_fast_ops)
nft_cmp_fast_eval(expr, ®s);
else if (expr->ops == &nft_cmp16_fast_ops)
nft_cmp16_fast_eval(expr, ®s);
else if (expr->ops == &nft_bitwise_fast_ops)
nft_bitwise_fast_eval(expr, ®s);
else if (expr->ops != &nft_payload_fast_ops ||
!nft_payload_fast_eval(expr, ®s, pkt))
expr_call_ops_eval(expr, ®s, pkt);
if (regs.verdict.code != NFT_CONTINUE)
break;
}
switch (regs.verdict.code) {
case NFT_BREAK:
regs.verdict.code = NFT_CONTINUE;
nft_trace_copy_nftrace(pkt, &info);
continue;
case NFT_CONTINUE:
nft_trace_packet(pkt, &info, chain, rule,
NFT_TRACETYPE_RULE);
continue;
}
break;
}
nft_trace_verdict(&info, chain, rule, ®s);
switch (regs.verdict.code & NF_VERDICT_MASK) {
case NF_ACCEPT:
case NF_DROP:
case NF_QUEUE:
case NF_STOLEN:
return regs.verdict.code;
}
switch (regs.verdict.code) {
case NFT_JUMP:
if (WARN_ON_ONCE(stackptr >= NFT_JUMP_STACK_SIZE))
return NF_DROP;
jumpstack[stackptr].chain = chain;
jumpstack[stackptr].rule = nft_rule_next(rule);
jumpstack[stackptr].last_rule = last_rule;
stackptr++;
fallthrough;
case NFT_GOTO:
chain = regs.verdict.chain;
goto do_chain;
case NFT_CONTINUE:
case NFT_RETURN:
break;
default:
WARN_ON_ONCE(1);
}
if (stackptr > 0) {
stackptr--;
chain = jumpstack[stackptr].chain;
rule = jumpstack[stackptr].rule;
last_rule = jumpstack[stackptr].last_rule;
goto next_rule;
}
nft_trace_packet(pkt, &info, basechain, NULL, NFT_TRACETYPE_POLICY);
if (static_branch_unlikely(&nft_counters_enabled))
nft_update_chain_stats(basechain, pkt);
return nft_base_chain(basechain)->policy;
} The main rule execution bit here is in the two nested for-loops around the middle of the function, right after the next_rule label. Netfilter is based on a virtual machine for rule execution, and each rule is translated into expressions, which correspond roughly to the words in a rule definition. The details of this are not really important for the problem at hand, so we’ll skip them. It’s enough to simply note that the outer loop goes through all the rules in the chain, and the inner one executes each expression in the chain, breaking out of the loop if the expression reaches a verdict.
Looking a bit further down in the code, the call to nft_trace_verdict() jumps out as really the obvious place to hook into. And indeed, the netfilter tracing functionality does allow us to see exactly what we want, but it’s not turned on by default and has some overhead associated with it, so we’ll skip that for this example.
Since we’re exploring eBPF functionality, let's see if we can replicate the functionality with eBPF instead.
Turning our hook point into a kprobe
The eBPF tracing infrastructure makes it possible to attach probes to any function in the running kernel (with some exceptions for functions explicitly marked “untraceable” because they run in critical context). To see all the kprobes in the running kernel, use bpftrace -l. But be warned, it’s a long list.
By default, a kprobe attaches at the start of a function and can access the function arguments. However, it is also possible to supply an offset to a kprobe, which is an instruction offset inside the function itself, meaning we can attach to an arbitrary instruction in the machine code of the compiled function. Refer to the kprobes documentation for details on how this works. This is exactly what we need to figure out to which instruction to attach.
First, we need to find a good place to hook in. Branches work well because they usually translate into obvious instructions. Looking at the C source, we already identified that the call site for nft_trace_verdict() would be a good place, but that is somewhat obscured by the static_call infrastructure. However, right after it, there is this branch, which turns out to be a better candidate:
switch (regs.verdict.code & NF_VERDICT_MASK) {
case NF_ACCEPT:
case NF_DROP:
case NF_QUEUE:
case NF_STOLEN:
return regs.verdict.code;
} To find the attachment point, we can pass to bpftrace. We’ll need to look at the disassembled code of the compiled function. We’ll use objdump and its support for reading debuginfo and emitting the source code of the function as part of the output, using the -S and --line-numbers options. These options only work if the kernel debuginfo is installed, which can be done by running sudo dnf debuginfo-install kernel. The following output is snipped to the relevant bits. We've done this experiment on Arm64.
$ objdump --disassemble=nft_do_chain -rw -S --line-numbers nf_tables.ko
Disassembly of section .text:
0000000000000134 <nft_do_chain>: <-- Base
nft_do_chain():
< snip >
nft_trace_verdict(&info, chain, rule, ®s);
switch (regs.verdict.code & NF_VERDICT_MASK) {
2dc: 721e145f tst w2, #0xfc <-- Offset
2e0: 54000fa0 b.eq 4d4 <nft_do_chain+0x3a0> // b.noneAt the start of the output, we see the base of the function (marked with <-- Base), and further down we find the instruction we’re looking for (marked with <-- Offset).
Now we just need to figure out how we are going to access the values we need: the struct nft_rule_dp pointer that leads to the rule, and the pkt pointer that allows us to look at the packet data itself.
The latter one is actually fairly straightforward: the pkt pointer is passed in as the first argument to the function when it starts, meaning it starts in the x0 register. Near the start of the function, there’s an instruction that moves the contents of x0 to x21, which is not overwritten:
158: aa0003f5 mov x21, x0
We can also find the struct nft_rule_dp pointer from the code that loads it:
rule = (struct nft_rule_dp *)blob->data;
1e0: aa0603fb mov x27, x6Putting it all together
Now we know where to attach our kprobe and where to find the information we are looking for. We have the offset into the function (computed as the difference between the base and offset values previously indicated), the verdict code in the w2 register, and two pointers stored in registers x21 and x27 which point to the data structures that will allow us to extract the data we need.
Due to the eBPF tracing magic, looking at the pointers is safe even if they turn out to be invalid. We don’t crash the kernel by a bad pointer deref; we’ll just get an invalid value back. So if we get reasonable values, we know that we are poking at the right bit of memory.
So given all this, how do we turn this into an eBPF program that we can load into the kernel?
Well, the easiest way to go about this is using the bpftrace utility, which is packaged and supported in Fedora and RHEL. With this, we can write a small script in the special-purpose scripting language, which bpftrace will dynamically compile into an eBPF program and load into the kernel for us. Bpftrace even lets us cast our pointers directly to their struct values and dereference them to get at the values we need, using the embedded type information from the running kernel.
To try it, we use the following bpftrace script:
#!/usr/bin/bpftrace
kprobe:nft_do_chain+0x1a8 {
if ((reg("r2") & 0xff) == 0) {
$pkt = (struct nft_pktinfo * )reg("r21");
$skb = (struct sk_buff *)$pkt->skb;
$rule = (struct nft_rule_dp *)reg("r27");
printf("Packet with len %d dropped by rule with handle %d\n",
$skb->len,
$rule->handle);
}
}This first looks at the return code, which we previously determined is in the r2 register (which is the same as the w2 register we saw earlier; bpftrace just uses a different naming convention for registers). NFT_DROP has a value of 0. We’re only interested in drops, so we only print something if this was a drop verdict. That’s the if statement in the bpftrace script.
When we do get a drop verdict, we simply print out a message on the console with the packet length and the handle of the rule that gave the verdict. If we wanted to look at the packet data, we’d have to do a bit more work. But for this short example, the length will suffice. Similarly, we get the rule handle from the nft_rule_dp struct and print it along with the packet length.
Saving the script above as nfdrop.bt and running it, we get output like this (while producing test traffic from another machine that we know will get dropped):
$ sudo ./nfdrop.bt
Attaching 1 probe...
Packet with len 60 dropped by rule with handle 7
Packet with len 60 dropped by rule with handle 7
Packet with len 60 dropped by rule with handle 7
Packet with len 60 dropped by rule with handle 7
Packet with len 60 dropped by rule with handle 7
^CSuccess! Handle 7 is the extra rule we added to the ruleset to test with, and this output happens whenever we produce traffic with a destination port of 10000, like the rule specifies. Dumping the traffic with tcpdump confirms that the packet size is 60 bytes, so we’re getting sane values.
Kernel version limitations
While the resulting script is quite small, it is also entirely specific to the kernel binary that we are currently running. Any change in the code, or even a change in how the compiler generates the code of that function, will invalidate it, as that will change the function offsets and/or the register usage.
This means that it is challenging to write a reusable script that works across different kernel versions. But as long as the machines run the same kernel RPM version, the script will work across all of them. New versions are not guaranteed to invalidate the offsets, but if the kernel version changes, the script should be checked to make sure that it still works.
Takeaways
With this investigation, we have seen an example of the power of eBPF. We can attach arbitrary code deep inside the middle of an arbitrary kernel function and extract information from it. This is an incredibly powerful technique to obtain any kind of information from a running kernel. Remember, we did all this without changing anything in the kernel code, or even rebooting it.
In this case, getting the information we wanted involved quite a bit of manual work to decipher the function code and find the attach point. However, many functions are less complex than the one we were looking at here, in which case finding the right attach point can be easier. In many cases, just attaching at the start or the end of a function (or both) can supply the information needed.
Indeed, many such cases have been turned into reusable utilities shipped as part of the libbpf-tools collection (also packaged in RHEL and Fedora). You can use these utilities for anything from listing TCP flow information, to showing kernel scheduler metrics, to debugging internal kernel locking, all in a portable way. For things not covered by this, the bpftrace utility we used in this example has excellent documentation for everything from simple one-liners to extensive tracing scripts.
We hope this article has been illuminating and equipped you to dig into eBPF tracing the next time you need to figure out what’s going on inside the Linux kernel.