Compared to SystemTap's default backend, one of stapbpf's most distinguishing features is the absence of a kernel module runtime. The BPF machinery inside the kernel instead mostly handles its runtime. Therefore it would be very helpful if BPF provided us with a way for states to be maintained across multiple invocations of BPF programs and for userspace programs to be able to communicate with BPF programs. This is accomplished by BPF maps. In this blog post, I will introduce BPF maps and explain their role in stapbpf's implementation.
What are BPF maps?
BPF maps are essentially generic data structures consisting of key/value pairs. They are created from userspace using the BPF system call, which returns a file descriptor for the map. The key size and value size are specified by the user, allowing for the storage of key/value pairs with arbitrary types. Once a map is created, elements can be accessed from userspace using the BPF system call. Maps are automatically deallocated once the user process that created the map terminates (although it is possible to force the map to persist longer than this process). Stapbpf uses the following function to create new BPF maps.
int bpf_create_map(enum bpf_map_type map_type, unsigned key_size, unsigned value_size, unsigned max_entries) { union bpf_attr attr; memset(&attr, 0, sizeof(union bpf_attr)); attr.map_type = map_type; attr.key_size = key_size; attr.value_size = value_size; attr.max_entries = max_entries; return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)); }
map_type
represents the type of BPF map that will be instantiated. Currently, there are about 15 different map types. I will focus on BPF_MAP_TYPE_HASH
and BPF_MAP_TYPE_ARRAY
, which are the two map types most frequently used in stapbpf. While both types of maps consist of key/value pairs, BPF array map keys must be 32-bit array indices. Compared to BPF HashMaps, BPF array maps have the advantage of faster lookup times. This speed comes with a trade-off, however, as updates to array map elements are non-atomic while updates to HashMap elements are atomic.
From the kernel-side, BPF maps are accessible via BPF helper functions. These functions are callable from within BPF programs and they provide an interface to the maps that closely mirrors the userspace interface. In fact, the map lookup and update operations issued through the BPF system call and the BPF helper functions end up calling the same kernel-side functions. These functions, in turn, call specialized functions based on the type of map being acted on (ex. array_map_lookup_elem
, htab_map_lookup_elem
). Though the following example is a userspace function used in stapbpf, the call to add/update map elements from within kernel-side BPF programs behaves similarly.
int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags) { union bpf_attr attr; memset(&attr, 0, sizeof(union bpf_attr)); attr.map_fd = fd; attr.key = ptr_to_u64(key); attr.value = ptr_to_u64(value); attr.flags = flags; return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); }
One of BPF's primary advantages over other performance analysis tools is its emphasis on safety. This emphasis also applies to the kernel's handling of BPF maps. For example, the kernel keeps track of the number of references to each BPF map and frees a map only when this count reaches 0. Also, the kernel's BPF verifier will reject any BPF program that does not perform the appropriate error checking when calling BPF helper functions that, for example, return pointers to map elements or tries to index the map with an incorrectly sized key.
How are BPF maps used in stapbpf?
BPF maps have a fundamental role in stapbpf's implementation. First, they allow for states to be maintained across multiple probe handlers and across multiple calls to individual probe handlers. This offers us a relatively simple way to implement global variables in stapbpf. Global variables are stored in a HashMap made accessible to any BPF program that references them. The HashMap type is used in order to ensure that any concurrent reads and writes to the map are handled safely.
Maps are also used to implement local variables. Each probe is assigned its own map in which its local variables are stored and isolated from other probes. Arrays are just as simple to implement in stapbpf as most array operations correspond quite naturally to operations on BPF maps. One exception to this is the iteration over the elements of an array. Currently, BPF does not provide a convenient way to iterate over arrays as BPF currently rejects any BPF program that contains loops.
BPF maps play another key role in stapbpf as a communication channel between kernel space BPF programs and stapbpf's userspace runtime. One use of such a channel is to communicate to stapbpf that the exit function has been called from a probe handler. This function updates a value in a map that is periodically checked by stapbpf. Once it sees that this value has been set to true, stapbpf begins to shut down.
For the following stap script, I will describe how BPF maps are used during the script's execution.
#! /usr/bin/stap global total probe kernel.function("vfs_write").return { if ($return > 0) total += $return } probe timer.s(10) { printf("%d bytes written in 10 seconds\n", total) exit() }
This simple script records and prints the total number of bytes written by the vfs_write
function during a ten second period. Stapbpf translates it into two BPF programs: one program for the vfs_write
return probe and another for the timer probe. Two maps are used during the execution of this script. Map 0 stores stapbpf's exit status while map 1 stores the script's global variables (in this case map 1 contains only total
). Both maps are of type HashMap in order to ensure safe concurrent accesses.
Now I will focus on the expression total += $return
and walk through the BPF instructions it is translated into. This would normally include the instructions for reading the current value of $return
, but for simplicity, I'm not going to include this (BPF maps are not involved, instead this value is read using a data structure supplied by BPF at runtime).
0: [read the current value of $return and store it in r6] 1: r1 = addr_of_map1 2: *(u32 *)(r10 - 4) = 0 3: r2 = r10 4: r2 += -4 5: call bpf_map_lookup_elem 6: if r0 == 0x0 goto exit_insn 7: r0 = *(u64 *)(r0 + 0) 8: r0 += r6 9: r1 = addr_of_map1 10: *(u32 *)(r10 - 12) = 0 11: r2 = r10 12: r2 += -12 13: *(u64 *)(r10 - 8) = r0 14: r3 = r10 15: r3 += -8 16: r4 = 0 17: call bpf_map_update_elem
We start with $return
already stored in BPF's register 6, so the next step is to read the current value of total
. To do this the BPF program needs to call bpf_map_lookup_elem
. As arguments, this BPF helper function takes the address of a BPF map and the address of the key whose value we wish to lookup. BPF's ABI dictates that we store these arguments in r1
and r2
prior to calling bpf_map_lookup_elem
. Storing the address of map 1 takes only a single instruction since prior to the loading of this BPF program, the address of map 1 was included in instruction 1 as an immediate value. The key associated with total
happens to be 0, so instructions 2-4 consist of storing 0 on the BPF stack and storing its address in r2
. r10
is used here since it always contains the address of the current stack frame. Instruction 5 consists of a call to bpf_map_lookup_elem
, which stores the address of total
in r0
. However, if an error occurred, 0 will be stored instead. The kernel's BPF verifier requires that we check whether 0 was returned. Instruction 6 performs this check and jumps to the exit instruction if such an error is detected. Now that this address is known to be valid, we store total
in r0
and add $return
to it (instructions 7 and 8).
The next step is to update map 1 with the new value of total. This requires a call to bpf_map_update_elem
. It takes four arguments-addresses of a map, key, value, and "flags" (flags lets us specify whether we are adding a new element or simply updating an existing one). Instructions 9 to 16 perform the steps required to store these arguments in r1
to r4
. The process is very similar to what is described in the previous paragraph. Once complete, bpf_map_update_elem
is called and the total is finally set to the correct value. This process is repeated each time the vfs_write
return probe fires. Once the timer probe fires and finishes printing, exit
sets the exit status to true using bpf_map_update_elem
. From this point on, these two probes will see that the exit status is true using bpf_map_lookup_elem
and simply perform early returns while stapbpf carries out various shutdown procedures.
Future work and BPF alternatives
As stapbpf is in an early stage of development there is still much work to be done improving how BPF maps are utilized. For example, while SystemTap's default backend supports collections such as aggregates, arrays of strings and multi-dimensional arrays, stapbpf currently supports only one-dimensional arrays of scalars. And while stapbpf supports atomic updates of variables and array elements with BPF HashMaps, the default backend supports atomicity at the preferable probe handler level. As a user of stapbpf, BPF maps themselves remain hidden implementation details. If a user wishes to interact more directly with BPF maps, other BPF frontends such as BCC (BPF Compiler Collection) support the development of custom BPF tools using, for example, C and Python.
Whether you are new to Linux or have experience, downloading this cheat sheet can assist you when encountering tasks you haven’t done lately.
Last updated: April 20, 2018