Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Measuring BPF performance: Tips, tricks, and best practices

June 22, 2022
Dmitrii Dolgov Giles Hutton
Related topics:
Linux
Related products:
Red Hat Enterprise Linux

    Measuring stuff is harder than you may think. Even in math, there is such a thing as non-measurable sets and Banach-Tarski paradox, which shows that a three-dimensional ball of radius 1 can be dissected into 5 parts, which can be magically reassembled to form two balls of radius 1. Now imagine what kind of mess we've got in software engineering.

    This may sound somewhat philosophical, so let's get down to earth and talk about BPF programs. BPF, which originally stood for Berkeley Packet Filter, is a general-purpose execution engine that can be used for a variety of purposes, including networking, observability, and security. What we'll be discussing in this article was initially called extended BPF (eBPF) to differentiate from the classic BPF, but now this technology is often referred to as just BPF, and that's what we'll call it here.

    Having full visibility throughout the systems we build is a well established best practice. Usually, one knows which metrics to collect, and how and what to profile or instrument to understand why the system exhibits certain levels of performance. But all of this becomes more challenging as soon as the BPF layer is included. Despite years of development, it is still a black box in many ways.

    But do not despair: this article will show, with some amount of creativity, that one can reveal what is going on under the hood. The article will begin with a userspace service that uses BPF programs attached to various syscalls to inspect system activity. It does the job, but what we really want is to understand the overhead of the BPF programs themselves. How do you look inside? That's the problem that we will ultimately tackle.

    Anatomy of a system call

    There is a tradition of capturing explanations of system calls (syscalls) under the title "anatomy of a system call," which we diligently follow. Since we acquire most of the information from syscalls, the first step is understanding how they work.

    Each architecture application binary interface (ABI) has requirements influencing how system calls work. Let’s limit our discussion to x86_64. The standard ABI for how x86_64 is to put the system call number into the RAX register and the other parameters into specific registers, then issue the SYSCALL instruction, goes something like this:

    SYSCALL invokes an OS system call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR after saving the address of the instruction following syscall into RCX. The WRMSR instruction ensures that the IA32_LSTAR MSR always contains a canonical address.

    At system startup, the kernel provides a custom entry_SYSCALL_64 function for handling syscalls and writes the address of this handler function to the MSR_LSTAR register. The final step is to call the function pointer from the system call table using the number stored in RAX.

    Now, there could be a set of hooks specified, to be invoked before the execution of the actual syscall. It could be a custom probe registered for the corresponding static tracepoint or dynamic kprobe. We need to use them to instrument our application. Here is where things get tricky.

    Who traces the tracer?

    The main problem with measuring performance is measuring the right thing. In the case of measuring the performance of a program responsible for tracing another, this can be quite challenging. Consider sys_enter and sys_exit tracepoints that fire before or after a syscall is executed. We have BPF programs performing some activity when triggering those tracepoints. The exercise is to measure time spent in those programs.

    There is some machinery in the kernel itself to collect statistics about BPF program runtime that give us cumulative time and number of runs for a particular program:

    
    $ sysctl -w kernel.bpf_stats_enabled=1
    $ bpftool prog
    [...] 379: raw_tracepoint [...] run_time_ns 35875602162 run_cnt 160512637
    
    

    Such aggregated numbers are good for sanity checks; but usually, they are not enough.

    Another naive approach is to follow the same strategy as instrumenting normal userspace applications. Trace the same points, looking at any difference in performance (i.e., overall syscall latency) between the tracing program running and not running. All things considered, this approach could work, but only if you can guarantee that the performance-measuring probe is executed at the right time: before the start of the tracing program, and after it finishes (see Figure 1).

    Two BPF programs are attached to the sys_enter/sys_exit tracing points of the syscall
    Figure 1: Two BPF programs are attached to the sys_enter/sys_exit tracing points of the syscall.

    But alas, this will never happen because BPF programs are designed to be independent. So it barely makes sense to talk about their ordering.

    Given the difficulty of getting the execution timing right, can we measure from a different point? Our first thought was to measure inside the probes themselves instead. That way, we could ignore all the cruft of the kernel running the syscall and focus on our execution. If we can measure that correctly, then that should give us the overhead. Simple, right? In theory, yes. But in practice, it was a bit more complicated.

    The main difficulty was getting the information out of the BPF probe or the kernel module. Initially, we tried a quick and dirty approach of writing a message to the trace pipe and then consuming and processing those events to get the data.

    
     //somewhere inside your BPF prog
    bpf_trace_printk("Start timestamp: %lld", ts);
    
    
    
    $ cat /sys/kernel/debug/tracing/trace_pipe
    $ bpftool prog tracelog
    
    

    This gave us promising initial results. However, it was not practical because it introduced significant measurement overhead.

    Another point to measure could be a kernel function on the way to the syscall. We wondered whether there was kernel code around the syscall that we can attach to instead, guaranteeing the desirable probe execution order. By examining stacktraces captured via ftrace, we found that there is a small function that does just that. On amd64, it is called do_syscall_64. It is slightly higher in the call stack than sys_enter and sys_exit, so we can measure latency from the start of do_syscall_64 to the end of the function, which will include any overhead from our probes. Is this ideal, or not?

    Tale of two probes

    One minor inconvenience with do_syscall_64 is that on kernels newer than 5.6, it can’t be used for this measurement because of a notrace marking. In this sense, we end up at the mercy of kernel internals. But there are more annoying problems along the way.

    When we first fired up the measurements using kprobe and kretprobe to attach to the start/finish, we noticed that while the kprobe fired consistently, there was only a small number of kretprobe executions. Obviously, this didn’t make any sense because the calls were definitely complete, and there was no problem in userspace indicating that the syscalls were running correctly. Luckily, the kernel documentation provided some answers.

    A kprobe implements a kretprobe on the function entry. That probe stores the return address of the function and overwrites it with a trampoline address. When the function returns, the trampoline calls into the user-provided probe (i.e., our performance measuring BPF program). So why don’t we see all the returns?

    While the probed function is executing, an object of type kretprobe_instance stores its return address. Before calling register_kretprobe(), the user sets the maxactive field of the kretprobe struct to specify how many instances of simultaneous probing for the specified function can be probed simultaneously. register_kretprobe() pre-allocates the indicated number of kretprobe_instance objects.

    The problem is that there is a limit to the number of simultaneously stored return addresses. If you're running on a very busy codepath (like on every system call), then you're going to fill up this list very quickly and inevitably drop many events. The first experiments utilized bpftrace, which uses the perf API and does not support specifying maxactive. However, configuration is possible in a custom bcc script. In our case, we had to set it to a large value to capture all the events that occurred.

    If you are curious, there is an alternative approach. The functions kfunc and kretfunc in bpftrace use the fentry/fexit mechanism and BPF trampolines to do the same process, and they are not affected by kprobe machinery with maxactive.

    When you find a good solution, look for a better one

    This solution is good, but not perfect. It is a reasonable way to collect the measurements. It covers not only BPF programs but also the underlying syscall. This brings a lot of variance and annoying noise into the captured data. Is there a way to make the measurements narrower?

    The answer: BPF Type Format (BTF).

    Number of abbreviations is going down
    Figure 2: Shows the number of abbreviations going down.
    Figure 2: With the introduction of BTF, there is one fewer three-letter abbreviation available.

    When BTF information is available for the BPF program, the kernel JIT compiler emits instructions for providing fentry/fexit (seems like jumps 0xE9 ). We could use those to attach two more monitoring BPF programs to the start/finish of our target BPF program and collect various events in between. In this way, one measures only a particular BPF program without including anything else. In fact, bpftool prog profile and new perf stat -b work in the same way.

    Profiling BPF programs

    So far, this discussion has revolved around getting counters between two points. But we can profile our BPF programs to understand what is happening at the instruction level.

    We can profile a BPF program as part of any other kernel activity, but there seems to be no way to profile only one program. However, there is one possible approach to acquiring the tag of the BPF program: record samples of kernel activity, then annotate only those parts.

    
    $ bpftool prog id 5
    5: tracepoint name <tp_name> tag 19c24c00f06f9dce gpl
    $ perf record ...
    $ perf annotate -l -P bpf_prog19c24c00f06f9dce
    [...]
    // bpf_prog_19c24c00f06f9dce[48]
    13.56 :   48:callq  0xffffffffc5dfe8cc
     0.00 :   4d:mov -0x8(%rbp),%edi
     0.00 :  50:and  $0x2,%rdi
     0.00 :  54:test %rdi,%rdi
    [...]
    
    

    The resulting annotated instructions match exactly to the JITed dump of the BPF program from bpftool.

    Summary

    Hopefully, the information described here helps to shed some light on the darkest corners of your system. There is still much more to learn and develop, but a journey of a thousand miles begins with a first step.

    Recent Posts

    • Debugging image mode with Red Hat OpenShift 4.20: A practical guide

    • EvalHub: Because "looks good to me" isn't a benchmark

    • SQL Server HA on RHEL: Meet Pacemaker HA Agent v2 (tech preview)

    • Deploy with confidence: Continuous integration and continuous delivery for agentic AI

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    What’s up next?

    Advanced Linux Commands tile card - updated

    The Advanced Linux Commands cheat sheet presents a collection of Linux commands and executables for developers who are using the Linux operating system in advanced programming scenarios. Commands are organized by category and each one is presented with syntax, an explanation of what it is used for, and an example.

    Get the cheat sheet
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.