High-speed network packet processing presents a challenging performance problem on servers. Modern network interface cards (NICs) can process packets at a much higher rate than the host can keep up with on a single CPU. So, to scale the processing on the host, the Linux kernel sends packets to multiple CPUs using a hardware feature named Receive Side Scaling (RSS). RSS relies on a flow hash to spread incoming traffic across the RX IRQ lines, which will be handled by different CPUs. Unfortunately, there can be a number of situations where the NIC hardware RSS features fail; for instance, if the received traffic is not supported by the NIC RSS engine. When RSS is not supported by the NIC, the card delivers all packets to the same RX IRQ line and thus the same CPU.
Previously, if hardware features did not match the deployment use case, there was no good way to fix it. But eXpress Data Path (XDP) offers a high-performance, programmable hook that makes routing to multiple CPUs possible, so the Linux kernel is no longer limited by the hardware. This article shows how to handle this situation in software, with a strong focus on how to solve the issue using XDP and a CPUMAP redirect.
Faster software receive steering with XDP
The Linux kernel already has some software implementations of RSS called Receive Packet Steering (RPS) and Receive Flow Steering (RFS), but unfortunately, they do not perform well enough to be a replacement for hardware RSS. A faster and more scalable software solution uses XDP to redirect raw frames into a CPUMAP.
XDP is a kernel layer invoked before the normal network stack. This means XDP runs before allocation of the socket buffer (SKB), the kernel object that keeps track of the network packet. XDP generally avoids any per-packet memory allocations.
What is XDP?
XDP runs an eBPF-program at the earliest possible point in the driver receive path, when DMA rx-ring is synced for the CPU. This eBPF program parses the received frames and returns an action or verdict, acted on by the networking stack. Possible verdicts are:
- XDP_DROP: Drop the frame, which at the driver level means recycle without alloc.
- XDP_PASS: Let the frame pass to normal network stack handling.
- XDP_TX: Bounce the frame out of the same interface.
- XDP_REDIRECT: The advanced action this article focuses on.
Figure 1 shows the XDP architecture and how XDP interacts with the Linux networking stack.
Redirecting into a CPUMAP
BPF maps are generic key-value stores and can have different data types. The maps are used both as an interface between a user-space application and an eBPF program running in the Linux kernel, and as a way to pass information to kernel helpers. As of this writing, there are 28 map types.
For our use case of software RSS with XDP, the CPUMAP type (BPF_MAP_TYPE_CPUMAP
) is just what we need. The CPUMAP represents the CPUs in the system (zero) indexed as the map-key, and the map-value is the config setting (per CPU map entry). Each CPUMAP entry has a dedicated kernel thread bound to the given CPU to represent the remote CPU execution unit. The following pseudo-code illustrates the allocation of a CPUMAP entry and the related kernel thread:
static int cpu_map_kthread_run(void *data) { /* do some work */ } int cpu_map_entry_alloc(int cpu, ...) { ... rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, ...); kthread_bind(rcpu->kthread, cpu); wake_up_process(rcpu->kthread); ... }
We promised a faster solution with XDP, which is possible only due to the careful design and bulking details happening internally in CPUMAP. These internals are described at the end of the article in the Appendix section.
Moving raw frames to remote CPUs
The packet is received on the CPU (the receive CPU) to which the IRQ of the NIC RX queue is steered. This CPU is the one that initially sees the packet, and this is where the XDP program is executed. Because the objective is to scale the CPU usage across multiple CPUs, the eBPF program should use as few cycles as possible on this initial CPU—just enough to determine which remote CPU to send the packet to, and then to use the redirect eBPF helper with a CPUMAP to move the packet to a remote CPU for continued processing.
The remote CPUMAP kthread
receives raw XDP frame (xdp_frame
) objects. Thus, the SKB object is allocated by the remote CPU, and the SKB is passed into the networking stack. The following example illustrates changes to the kthread
pseudo-code to clarify SKB allocation and how SKBs are forwarded to the Linux networking stack:
static int cpu_map_kthread_run(void *data) { while (!kthread_should_stop()) { ... skb = cpu_map_build_skb(); /* forward to the network stack */ netif_receive_skb_core(skb); ... } }
Pitfall: Missing SKB information on the remote CPU
When an SKB is created based on the xdp_frame
object, certain optional SKB fields are not populated. This is because these fields come from the NIC hardware RX descriptor, and on the remote CPU this RX descriptor is no longer available. The two pieces of hardware "partial-offload" information that commonly go missing are HW RX checksum information (skb->ip_summed
+ skb->csum
) and the HW RX hash. Less commonly used values that also go missing are the VLAN, RX timestamp, and the mark value.
The missing RX checksum causes a slowdown when transmitting the SKB, because the checksum has to be recalculated. When the network stack needs to access the hash value (see the skb_get_hash()
function) it triggers a software recalculation of the hash.
New CPUMAP feature: Running XDP on the remote CPU
Starting from Linux kernel version 5.9 (and soon in Red Hat Enterprise Linux 8) the CPUMAP can run a new (second) XDP program on the remote CPU. This helps scalability because the receive CPU should spend as few cycles as possible per packet. The remote CPU to which the packet is directed can afford to spend more cycles, such as to look deeper into packet headers. The following example shows through pseudo-code what is executed when the eBPF program associated with the CPUMAP entry runs:
static int cpu_map_bpf_prog_run_xdp(void *data) { ... act = bpf_prog_run_xdp(); switch (act) { case XDP_DROP: ... case XDP_PASS: ... case XDP_TX: ... case XDP_REDIRECT: ... } ... } static int cpu_map_kthread_run(void *data) { while (!kthread_should_stop()) { ... cpu_map_bpf_prog_run_xdp(); ... skb = cpu_map_build_skb(); /* forward to the network stack */ netif_receive_skb_core(skb); ... } }
This second XDP program, which runs on each remote CPU, is attached by inserting the eBPF program (file descriptor) on a map-entry level. This was achieved by extending the map value, now defined as UAPI via struct bpf_cpumap_val
:
/* CPUMAP map-value layout * * The struct data-layout of map-value is a configuration interface. * New members can only be added to the end of this structure. */ struct bpf_cpumap_val { __u32 qsize; /* queue size to remote target CPU */ union { int fd; /* prog fd on map write */ __u32 id; /* prog id on map read */ } bpf_prog; };
Practical use case: An issue on low-end hardware
Some multicore devices on the market do not support RSS. All the interrupts generated by the NICs on such devices are managed by a single CPU (typically CPU0).
However, using XDP and CPUMAPs, it is possible to implement a software approximation of RSS for these devices. By loading an XDP program on the NIC to redirect packets to CPUMAP entries, you can balance the traffic on all available CPUs, executing just a few instructions on the core connected to the NIC IRQ line. The eBPF program running on the CPUMAP entries will implement the logic to redirect the traffic to a remote interface or forward it to the networking stack. Figure 2 shows the system architecture of this solution on the EspressoBin (mvneta). Most of the code is executed on the CPUMAP entry associated with CPU1.
Future development
Currently, CPUMAP doesn't call into the generic receive offload (GRO) system, which would boost TCP throughput by creating an SKB that points to several TCP data segments. In order to fill the gap with the SKB scenario, we need to extend CPUMAPs (and XDP in general) with support for jumbo frames, and leverage the GRO code path available in the networking stack. No worries, we are already working on it!
Acknowledgments
I would like to thank Jesper Dangaard Brouer and Toke Høiland-Jørgensen for their detailed contributions and feedback on this article.
Appendix
This section explains complexities that will interest some readers but are not necessary to understand the basic concepts discussed in the article.
Problems with older software receive steering
The Linux kernel already has a software feature called Receive Packet Steering (RPS) and Receive Flow Steering (RFS), which is logically a software implementation of RSS. This feature is hard to configure and has limited scalability and performance. The performance issue arises because RPS and RFS happen too late in the kernel's receive path, most importantly after the allocation of the SKB. Transferring and queuing these SKB objects to a remote CPU is also a cross-CPU scalability bottleneck that involves interprocessor communication (IPC) calls and moving cache lines between CPUs.
Pitfall: Incorrect RSS with Q-in-Q VLANs
When the NIC hardware parser doesn't recognize a protocol, it cannot calculate a proper RX hash and thus cannot do proper RSS across the available RX queues in the hardware (which are bound to IRQ lines).
This particularly applies to new protocols and encapsulations that get developed after the hardware NIC was released. This was very visible when VXLAN was first introduced. To extend some NICs, a firmware upgrade allows them to support new protocols.
Moreover, you would expect NICs to work well with the old, common VLAN (IEEE 802.1Q) protocol standard. They do, except that multiple or stacked VLANs seem to break on many common NICs. The standard for these multiple VLANs is IEEE 802.1ad, informally known as Q-in-Q (incorporated into 802.1Q in 2011). Practical Q-in-Q RSS issues have been seen with the ixgbe and i40e NIC drivers.
What makes XDP_REDIRECT special?
The XDP_REDIRECT verdict is different from the others in that it can queue XDP frame (xdp_frame
) objects into a BPF map. All the other verdicts need to take immediate action, because the xdp_buff
data structure that keeps track of packet data is not allocated anywhere; it is simply a variable in the function call itself.
It is essential, for the sake of performance, to avoid using per-packet allocations to convert the xdp_buff
into an xdp_frame
in order to allow queuing this object executing XDP_REDIRECT. To avoid any memory allocations, the xdp_frame
object is placed in the top headroom of the data packet itself. A CPU prefetch operation runs before the XDP eBPF program, to avoid the overhead of writing into this cache line.
Before returning an XDP_REDIRECT verdict, the XDP eBPF program calls one of the following BPF helpers to describe the redirect destination to which the frame should be sent:
bpf_redirect(ifindex, flags)
bpf_redirect_map(bpf_map, index_key, flags)
The first helper simply chooses the Linux network device destination using the ifindex
as a key. The second helper is the big leap that allows users to extend XDP redirect. This helper can redirect into a BPF map at a specific index_key
. This flexibility can be used for CPU steering.
The ability to bulk is important for performance. The map redirect is responsible for creating a bulk effect, because drivers are required to call an xdp_flush
operation when the NAPI poll budget ends. The design allows the individual map-type implementation to control the level of bulking. The next section explains how bulking is used to mitigate the overhead of cross-CPU communication.
Efficient transfer between CPUs
The CPUMAP entry represents a multi-producer single-consumer (MPSC) queue (implemented via ptr_ring
in the kernel).
The single consumer is the CPUMAP kthread
that can access the ptr_ring
queue without taking any lock. It also tries to bulk dequeue eight xdp_frame
objects, as they represent one cache line.
The multi-producers can be RX IRQ line CPUs queuing up packets simultaneously for the remote CPU. To avoid queue lock contention for each producer CPU, there is a small eight-object queue to generate bulk enqueueing into the cross-CPU queue. This careful queue usage means that each cache line transfers eight frames across the CPUs.
Last updated: May 7, 2021