Perl in RHEL 8

Diving into XDP

In the first part of this series on XDP, I introduced XDP and discussed the simplest possible example. Let's now try to do something less trivial, exploring some more-advanced eBPF features—maps—and some common pitfalls.

XDP is available in Red Hat Enterprise Linux 8, which you can download and run now.

[Not] Reinventing the wheel

We will start adding packet parsing to our sample; to simplify such task, we reuse the kernel definition for common networking protocol, adding the following to the include section of our XDP program:

#include <linux/in.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <linux/if_vlan.h>
#include <linux/ip.h>

We now need to access the packet contents via the XDP context. Let's take a look at its definition:

struct xdp_md {
    __u32 data;
    __u32 data_end;
    __u32 data_meta;
    /* Below access go through struct xdp_rxq_info */
    __u32 ingress_ifindex; /* rxq->dev->ifindex */
    __u32 rx_queue_index; /* rxq->queue_index */
};

The packet contents are between ctx->data and ctx->data_end. So we can add the parsing code and try to use the address somehow. In this case, we drop the packet with a zero IPv4 destination address:

/* Parse IPv4 packet to get SRC, DST IP and protocol */
static inline int parse_ipv4(void *data, __u64 nh_off, void *data_end, __be32 *src, __be32 *dest)
{
    struct iphdr *iph = data + nh_off;

    *src = iph->saddr;
    *dest = iph->daddr;
    return iph->protocol;
}

SEC("prog")
int xdp_drop(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    struct ethhdr *eth = data;
    __be32 dest_ip, src_ip;
    __u16 h_proto;
    __u64 nh_off;
    int ipproto;

    nh_off = sizeof(*eth);

    /* parse vlan */
    h_proto = eth->h_proto;
    if (h_proto == __constant_htons(ETH_P_8021Q) ||
        h_proto == __constant_htons(ETH_P_8021AD)) {
        struct vlan_hdr *vhdr;

        vhdr = data + nh_off;
        nh_off += sizeof(struct vlan_hdr);
        h_proto = vhdr->h_vlan_encapsulated_proto;
    }
    if (h_proto != __constant_htons(ETH_P_IP))
        goto pass;

    ipproto = parse_ipv4(data, nh_off, data_end, &src_ip, &dest_ip);
    if (!dst_ip)
        return XDP_DROP;

pass:
    return XDP_PASS;
}

Tripping on the verifier

The above code should compile just fine, but if we try to load it with iproute, we get a bad surprise:

Prog section 'prog' rejected: Permission denied (13)!
- Type: 6
- Instructions: 19 (0 over limit)
- License:

Verifier analysis:

0: (61) r1 = *(u32 *)(r1 +0)
1: (71) r2 = *(u8 *)(r1 +13)
invalid access to packet, off=13 size=1, R1(id=0,off=0,r=0)
R1 offset is outside of the packet

Error fetching program/map!

It fails to pass the verifier check! The verifier error message could be somewhat misleading, as we are accessing the first few handful of bytes of the packet. We know that each Ethernet frame must be at least 64 bytes long and, thus, we know we are accessing valid offsets inside the packet payload.

The verifier, instead, relies only on explicit checks: before access/manipulating any offset inside the packet, we must add a conditional check that such an offset
is inside the packet body. In our example, before accessing each header, we must ensure that the header tail is below the packet end, by adding a patch like this:

@@ -17,6 +17,9 @@ static inline int parse_ipv4(void *data, __u64 nh_off, void *data_end,
  {
      struct iphdr *iph = data + nh_off;

+     if (iph + 1 > data_end)
+         return 0;
+
      *src = iph->saddr;
      *dest = iph->daddr;
      return iph->protocol;
@@ -34,6 +37,8 @@ int xdp_drop(struct xdp_md *ctx)
      int ipproto;

      nh_off = sizeof(*eth);
+     if (data + nh_off > data_end)
+         goto pass;

      /* parse vlan */
      h_proto = eth->h_proto;
@@ -43,6 +48,8 @@ int xdp_drop(struct xdp_md *ctx)

      vhdr = data + nh_off;
      nh_off += sizeof(struct vlan_hdr);
+     if (data + nh_off > data_end)
+         goto pass;
      h_proto = vhdr->h_vlan_encapsulated_proto;
      }
      if (h_proto != __constant_htons(ETH_P_IP))

The verifier should be happy now!

Custom XDP loader

We already talked about maps in part 1; let's see how we can use them in practice. We want to enhance our XDP program to allow the user to configure the addresses to be dropped at runtime and also to be able to read the related stats.

As a first step, we need to replace the iproute2 tool with a custom loader program, as the tool does not allow maps manipulation. The code fragment used to load the XDP program should be something like this:

#include <bpf/bpf.h>
#include <bpf/libbpf.h>
#include <error.h>

// [ ... ]
    struct bpf_prog_load_attr prog_load_attr = {
        .prog_type = BPF_PROG_TYPE_XDP,
        .file = "xdp_drop_kern.o",
    };
// [ ... ]
    if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
        error(1, errno, "can't load %s", prog_load_attr.file);

    ifindex = if_nametoindex(dev_name);
    if (!ifindex)
        error(1, errno, "unknown interface %s\n", dev_name);
    if (bpf_set_link_xdp_fd(ifindex, prog_fd, 0) < 0)
        error(1, errno "can't attach to interface %s:%d: "
              "%d:%s\n", dev_name, ifindex, errno,
              strerror(errno));
// [ ... ]
    // cleaning-up
    bpf_set_link_xdp_fd(ifindex, -1, 0);

We are using the libbpf helper library, bundled in the Linux kernel sources. bpf_prog_load_xattr() loads the eBPF program specified by the prog_load_attr argument. It will parse all the elf sections of the specified object extracting all the related info and placing it into the obj status data. Each found program (text section) is then loaded inside the kernel via a newly allocated file descriptor (prog_fd).

Such a file descriptor is later used to attach the loaded program to the selected device, via the bpf_set_link_xdp_fd() function. The last argument allows the user to specify several flags, such as a flag for replacing the existing XDP program, if any, or a flag for using the driver-level XDP hook. By default:

  • It will try to use the driver-level hook and then fall back to the common one.
  • If an XDP program is already installed on the specified device, it will fail.

Finally, the last helper, to be invoked at program termination, detaches the XDP program from the NIC and frees all the associated kernel resources.

Interacting with user space

Let's now move to the juicy part: maps! Every data structure shared between the user space and the eBPF program is called a "map," but there are actually several different types: hashmap, array, queue, and so on. Usually, there are two different variants: simple and per-CPU. With the per-CPU variant, each entry is replicated for all the locally available CPUs; inside the kernel, each CPU will access only its private copy. The per-CPU variant avoids any kind of contention-related issue and it's the preferred one when the eBPF program must modify the data entries on a per-packet basis.

The map data will be accessed by both user space and the eBPF program. It's convenient to add the data type definition in a header file included by both sides. In this example, we use a map to specify the source addresses to be filtered and count the number of bytes and packets dropped for each specified address. To add such a map to our program, we need something like this:

    // in xdp_drop_common.h
    struct stats_entry {
        __u64 packets;
        __u64 bytes;
    };

    // in xdp_drop_kern.c
    #include "xdp_drop_common.h"
    // [ ... ]
    /* forwarding map */
    struct bpf_map_def SEC("maps") egress_map = {
        .type = BPF_MAP_TYPE_PERCPU_HASH,
        .key_size = sizeof(__be32),
        .value_size = sizeof(struct stats_entry),
        .max_entries = 100,
    };

    // in xdp_drop_user.c
    struct bpf_map *map;
    int map_fd;
    // [ ... ]
    map = bpf_object__find_map_by_name(obj, "drop_map");
    if (!map)
        error(1, errno, "can't load drop_map");
    map_fd = bpf_map__fd(map);
    if (map_fd < 0)
        error(1, errno, "can't get drop_map fd");

Note that our map is really a per-CPU hash table, and its definition contains only the key and value size, as the kernel needs only such info to do the allocation, perform the lookup, and do the entry update. The map definition contains also the maximum number of entries allowed inside such a map. Hashmaps are initially empty, and inserting the above such limit will fail. Arrays have a fixed size equal to the specified limit. The user space can access the map via a specified file descriptor. Using the libbpf helpers may look a little over-complicated here, but it really helps when the eBPF program exposes multiple maps.

We are now ready to add the user space/eBPF interaction:

    // in xdp_drop_kern.c
    struct stats_entry entry;
    // [ ... ]
    stats = bpf_map_lookup_elem(&drop_map, &src_ip);
    if (!stats)
        goto pass;

    stats->packets++
    stats->bytes += ctx->data_end - ctx->data;
    return XDP_DROP;

    // in xdp_drop_user.c
    // [ ... ]
    memset(&entry, 0, sizeof(entry));
    if (bpf_map_update_elem(map_fd, &saddr, entry, BPF_ANY))
        error(1, errno, "can't add address %s\n", argv[i]);
    // [ ... ]
    if (bpf_map_lookup_elem(map_fd, &ipv4_addr, &entry))
        error(1, errnom "no stats for rule %x %x\n",
              ipv4_addr);
    printf("addr %x drop %ld:%ld\n", ipv4_addr,
           entry.packets, entry.bytes);

Now the eBPF program drops the packet only if the source IP address is found in the drop_map hash table, and it updates the related stats. The user-space program fills such a map with zeroed stats and (periodically) looks up such entries, printing out the stats reported by the eBPF program.

For brevity, boilerplate user-space code to fetch somewhere the source address [list] and to gracefully terminate is omitted; when that is included, we are
ready to build and run.

Some map caveats

The results obtained with the current code could be disappointing, ranging from random crashes of the user-space program to the eBPF filter being apparently ineffective. If the user-space program terminates abnormally, it will leave the XDP program attached to the network device and later execution will fail on startup. In such a case, the user needs to manually detach the XDP program with iproute:

ip link set dev <NIC> xdp off

In some lucky cases, the current code could work almost flawlessly, just failing to detach the XDP program at shutdown time.

While some of you may already guess where the problem is, we are going to use an XDP/eBPF debugging facility to dump the program status, by adding the following to xdp_drop_kern:

@@ -33,6 +33,13 @@ static inline int parse_ipv4(void *data, __u64 nh_off, void *data_end,
      return iph->protocol;
  }

+     #define bpf_printk(fmt, ...) \
+     ({ \
+         char ____fmt[] = fmt; \
+         bpf_trace_printk(____fmt, sizeof(____fmt), \
+         ##__VA_ARGS__); \
+     })
+
      SEC("prog")
      int xdp_drop(struct xdp_md *ctx)
      {
@@ -45,6 +52,8 @@ int xdp_drop(struct xdp_md *ctx)
      __u64 nh_off;
      int ipproto;

+     bpf_printk("xdp_drop\n");
+
      nh_off = sizeof(*eth);
      if (data + nh_off > data_end)
          goto pass;
@@ -72,6 +81,8 @@ int xdp_drop(struct xdp_md *ctx)
      if (!stats)
          goto pass;

+     bpf_printk("xdp_drop pkts %lld:%lld\n", stats->packets, stats->bytes);
+
      stats->packets++;
      stats->bytes += ctx->data_end - ctx->data;
      return XDP_DROP;

Then we can run again the example and observe the messages emitted in:

/sys/kernel/debug/tracing/trace

The eBPF helper is invoked correctly for each ingress packet.

If you are lucky enough, you may observe that the stats associated with the map entry created by the user space look corrupted, for example, containing fairly random values even when the first packet is received after the entry creation.

We are using a per-CPU map: when setting the entry, the kernel reads <number of possible CPUs> values from the specified data address, copying each of them into the corresponding per-CPU value inside the kernel map—in our case, accessing data other than the stats variable on the user-space program stack.

Moreover, when the user-space process tries to read an entry from the map, the kernel copies the same amount of data to the specified address, again hitting data on the stack, and causing the random behavior mentioned above.

The solution is simply allocating enough storage for the map entry with something like this:

    int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
     struct stats_entry *entry;
// [ ... ]
    entry = calloc(nr_cpus, sizeof(struct stats_entry));
    if (!entry)
        error(1, 0, "can't allocate entry\n");

And, when reading the map, walk and aggregate all the values:

    struct stats_entry all = { 0, 0};

    if (bpf_map_lookup_elem(map_fd, &ipv4_addr, entry))
        error(1, errno, "no stats for address %x\n",
              ipv4_addr);

    for (j = 0; j < nr_cpus; j++) {
        all.packets += entry[j].packets;
        all.bytes += entry[j].bytes;
    }

Now our IP filter application is ready!

The road ahead

In this article, we covered some of the functionality offered by XDP/eBPF, but there is much more. For example, there are many more eBPF helpers ready to be used for various tasks: updating the packet checksum after some modification, packets forwarding, and so on.

A good starting point is with this header inside the Linux kernel sources, which contains the official documentation for the implemented helpers:

include/uapi/linux/bpf.h

Moreover, the samples/bpf/ directory, still in the kernel sources, contains several more-complex XDP examples. A relevant background is required before going there, though.

The full source for the example discussed above can be found at:

https://github.com/altoor/xdp_walkthrough_examples

Happy hacking!

See also:

Download RHEL 8 Now

 

Last updated: March 24, 2023