Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

stalld’s BPF Backend: Breaking Free from debugfs

An efficient approach to real-time task starvation detection

May 28, 2026
Clark Williams Wander Lairson Costa
Related topics:
Application development and deliveryLinux
Related products:
Red Hat Enterprise Linux

    For years, stalld has been a critical tool for maintaining stability in real-time and CPU-isolated Linux environments. Its core function is to detect and mitigate task starvation, particularly for kernel threads that get stuck waiting for a chance to run on a CPU core dominated by a high-priority, user-space application. The original mechanism for this detection was straightforward but crude: stalld would periodically read and parse the text output of /sys/kernel/debug/sched/debug.

    While functional, this approach was always a compromise. Relying on debugfs meant treating its output as a default API, even though it was explicitly designed for debugging and was subject to change without notice. Each significant kernel update brought the risk of format changes that would break our parser, leading to a brittle and reactive development cycle. Furthermore, the polling-based model was inefficient, imposing unnecessary overhead and creating windows where short-lived starvation events could be missed entirely.

    The technical evolution of the Linux kernel, specifically the maturation of BPF and BPF CO-RE (compile once - run everywhere), presented a clear path forward. We could move from a polling, text-parsing model to a lightweight, event-driven one. This post details the design and implementation of the new BPF-based queue_track backend for stalld. This new backend frees stalld from the fragile dependency on debugfs, making it safe to run on Secure Boot and locked-down systems.

    What is stalld?

    To understand the motivation for the BPF backend, you need to understand the problem stalld solves: Task starvation on systems with isolated CPUs. In high-performance computing (HPC) and network function virtualization (NFV) workloads, particularly those using frameworks like DPDK, it's common practice to dedicate one or more CPU cores to a single, high-priority application. This is done by setting the CPU affinity of the task and its threads, effectively creating a private CPU for the application.

    This isolation is excellent for application performance, because it minimizes context switches and cache pollution. However, it creates a subtle but severe problem. The Linux kernel still needs to perform housekeeping tasks on all CPUs, including the isolated ones. Kernel threads (kworkers), migration threads, and other essential services must occasionally run on these cores. When a user-space application is spinning at 100% on an isolated core, these kernel tasks can get "stuck" in the CPU's runqueue, waiting indefinitely for the user-space task to yield. This is task starvation, and it can lead to system instability, soft lockups, and unpredictable behavior.

    The solution stalld offers is to monitor the runqueues of each CPU. When it detects a task that has been waiting for an excessive amount of time, it intervenes. The intervention is a temporary, powerful priority boost. stalld changes the scheduling policy of the starved task to SCHED_DEADLINE, the highest priority scheduling class in Linux. This guarantees the starved task gets to run almost immediately, perform its duties, and then yield the CPU back to the main application. Once the task completes its work and goes to sleep, stalld restores its original scheduling policy.

    To make this decision, stalld must collect key information about each task in a CPU's runqueue. At its core, this data is simple:

    struct task_info {
        pid_t pid;
        int tgid;
        int prio;
        time_t since;  // When task entered runqueue
        long ctxsw;
    };

    The pid identifies the task, and since identifies how long it has been waiting. The challenge, and the subject of this post, has always been how to collect this data efficiently and reliably.

    The old way: The sched_debug backend

    The original stalld implementation, now known as the sched_debug backend, relied on the only interface available at the time that exposed the contents of CPU runqueues: The scheduler's debug file. This file, located at /proc/sched_debug or, more commonly, /sys/kernel/debug/sched/debug, provides a human-readable snapshot of the scheduler's state for the entire system.

    The operational loop was simple:

    1. Open and read the entire contents of sched/debug into a buffer.
    2. Iterate through the buffer line-by-line.
    3. Use a complex string-parsing state machine to identify CPU sections and the tasks listed within their runqueues.
    4. For each task, extract the relevant fields (PID, priority,and so on).
    5. If a task's wait time exceeded a configured threshold, then flag it for boosting.
    6. Sleep for a polling interval, and repeat.

    This approach had several significant drawbacks. The most immediate one was the lack of good tooling to handle kernel evolution. Kernel internals change between versions — this is true whether you're parsing text from debugfs or reading kernel data structures directly. The difference lies in how you manage those changes. With debugfs, we had no choice but to implement multiple parsers and runtime format detection. Over the years, the sched/debug format changed multiple times: we needed one parser for kernel 3.x, another for 4.18, and yet another for 6.12. The code accumulated version-specific if/else branches, string patterns, and detection logic. Each new kernel version risked breaking stalld until we manually patched it.

    The second major issue was performance. The polling-based approach is inherently inefficient. At every interval,` stalld` would wake up, read a potentially large text file from the kernel, and spend CPU cycles parsing thousands of lines of text, most of which described tasks that were not starving. This overhead scales linearly with the number of tasks on the system, making it less suitable for machines with high process counts.

    Finally, the snapshot-in-time nature of polling is a fundamental limitation. If a kernel thread was enqueued and subsequently dequeued between two stalld polls, its starvation event would be completely missed. For short but critical housekeeping tasks, this visibility gap was a known risk that we had to accept. The old way worked, but it was far from ideal.

    The new way: The BPF queue_track backend

    The new queue_track backend replaces the fragile, polling-based sched_debug parser with a modern, event-driven architecture built on BPF. This represents a fundamental shift in how stalld gathers data.

    Why BPF?

    BPF allows us to run sandboxed, custom programs within the Linux kernel itself. Instead of asking the kernel for a giant text dump and parsing it in user space, we can now place small, highly efficient BPF programs at strategic points in the kernel's scheduler code. These programs extract only the data we need, precisely when we need it.

    The advantages of this approach are numerous:

    • Event-driven: We attach BPF programs to scheduler tracepoints. These programs run automatically when scheduler events occur (like a task waking up or a context switch happening). This eliminates polling entirely and ensures we never miss an event.
    • Structured data: The BPF program writes the data it collects into a shared memory structure called a BPF map. The stalld user-space daemon simply reads from this structured map. The communication between kernel and user space is clean, binary, and efficient.
    • Low overhead: The BPF programs are JIT-compiled into native machine code and are extremely fast. They perform minimal work — just a few data reads and a write to a map. The overhead on the scheduler is negligible, and the CPU usage of the stalld daemon is dramatically reduced.
    • Portability with BPF CO-RE: Kernel data structures change between versions — this is the same fundamental challenge we faced with debugfs, but BPF CO-RE provides far better tooling to handle it. CO-RE uses BPF type format (BTF) information, which is now included with most modern kernels, to understand the layout of kernel structures at BPF program load time. The BPF loader automatically patches field offsets and handles structural differences. This allows a single, compiled BPF program to run correctly across a wide range of kernel versions. The source code stays clean — no version-specific parsers, no runtime detection logic, no #ifdef maze.

    Tracepoint architecture

    To accurately track a task's time in a runqueue, we need to monitor its entire lifecycle from the scheduler's perspective. We achieved this by attaching BPF programs to a few key tracepoints:

    • sched_wakeup / sched_wakeup_new: These tracepoints are triggered when a task transitions into a runnable state. This is our primary entry point. When a task wakes up, stalld records its PID and the current timestamp in our BPF map, marking the beginning of its potential wait time.
    • sched_switch: This is the most important tracepoint. It fires on every context switch. When a task is switched out (preempted), stalld checks whether it's still runnable. If so, it's going back into the runqueue, and we start timing it. When a task is switched in (begins executing), we know it's no longer waiting, so we remove it from our tracking map.
    • sched_process_exit: When a task exits, stalld must remove any associated tracking data from our maps to prevent stale entries. This tracepoint allows for clean-up.
    • sched_migrate_task: Tasks can be moved between CPU runqueues. This tracepoint allows our BPF program to update its per-CPU data structures to reflect the task's new location, ensuring that our accounting remains accurate.

    Key code examples

    The queue_track backend is implemented as a module that conforms to a generic stalld backend interface. This design allows users to switch between the new and old backends easily.

    The interface is defined as follows:

    struct stalld_backend {
      int (*init)(void);
      int (*get)(char *buffer, int size);
      int (*get_cpu)(char *buffer, int size, int cpu);
      int (*parse)(struct cpu_info *cpu_info, char *buffer, size_t buffer_size);
      int (*has_starving_task)(struct cpu_info *cpu);
      void (*destroy)(void);
    };

    The power of BPF CO-RE is best demonstrated by how we handle changes in core kernel structures like task_struct. For instance, the member pointing to the task's CPU changed its location inside a substructure across kernel versions. With CO-RE, we can write portable code to access it without any compile-time branching:

    struct task_struct___legacy {
      int cpu;
      unsigned int state;
    };
    
    static inline int task_cpu(const struct task_struct *p)
    {
      const struct task_struct___legacy *lp = (const void *) p;
      const struct thread_info___legacy *lt = (const void *) &p->thread_info;
    
      // lp and lt are BPF CO-RE pointers to the task_struct
      return bpf_core_field_exists(lp->cpu)
        ? BPF_CORE_READ(lp, cpu)    // Newer kernels
        : BPF_CORE_READ(lt, cpu);   // Older kernels
    }

    At load time, the BPF verifier checks which field exists in the target kernel's task_struct and patches the code to use the correct accessor.

    The heart of the BPF side is the map that shares data with user space. We use an array map, where each entry in the array corresponds to a CPU core and holds the data for tasks waiting on that core.

    struct {
      __uint(type, BPF_MAP_TYPE_ARRAY);
      __uint(max_entries, 1024); // Max CPUs supported
      __type(key, u32);
      __type(value, struct stalld_cpu_data);
    } stalld_per_cpu_data SEC(".maps");

    The user-space stalld process simply needs to read the stalld_cpu_data struct for the CPU it is interested in (although the queue_track backend is event based, the user space part of stalld relies on polling due to compatibility with the sched_debug backend). All complex logic of tracking tasks is handled efficiently inside the kernel by our BPF programs.

    Practical usage

    The queue_track backend is now the default on all supported architectures. You can verify which backend is active by running stalld in verbose mode. For example, verify the active backend (this shows queue_track on supported systems):

    stalld -v

    For debugging or comparison purposes, you can manually force stalld to use a specific backend with the -b flag. For example, to force the BPF backend:

    stalld -b queue_track

    To force the legacy debugfs backend:

    stalld -b sched_debug

    The BPF backend depends on the kernel being built with BTF support, which is necessary for BPF CO-RE. Most modern distributions enable this by default. You can check whether your system has BTF info with this command:

    ls /sys/kernel/btf/vmlinux

    If the file /sys/kernel/btf/vmlinux exists, then the queue_track backend should work. The current support matrix looks like this:

    Architecture with BPF support (using the queue_track backend by default):

    • x86_64
    • aarch64
    • s390x
    • riscv64
    • ppc64le

    Architecture without BPF support (using the sched_debug backend):

    • i686

    The i686 architecture is typically limited by the availability of BPF toolchain support (like libbpf) in its standard build environments, not a fundamental incompatibility with BPF itself.

    Performance comparison

    The theoretical benefits of the BPF approach are borne out in practice. The queue_track backend is superior to sched_debug across every meaningful metric:

    Metricsched_debug (Old)queue_track (New)
    Data SourceText file (debugfs)Binary BPF map
    Collection ModelPollingEvent-driven
    Parsing OverheadO(n) string parsingO(n) map lookup
    Missed EventsPossibleNone
    debugfs DependencyYesNo

    The most significant gains come from the shift to an event-driven model and the elimination of text parsing work in its main loop. It simply reads structured data from the BPF map when it needs to check for starving tasks. The removal of the debugfs dependency also makes stalld more robust and deployable in environments where debugfs might be unavailable.

    Conclusion

    The new queue_track backend marks a significant milestone for the stalld project. By leveraging BPF and BPF CO-RE, we have replaced a fragile, inefficient mechanism with a solution that is purpose-built for high-performance system monitoring. The benefits are clear: Near-zero overhead, complete visibility into scheduler events, and cross-kernel portability that was previously unimaginable.

    This work eliminates the long-standing technical debt of parsing debugfs and provides a solid foundation for future enhancements. BPF CO-RE, in particular, has proven to be the key technology for writing portable BPF applications, allowing us to support a wide range of kernel versions without resorting to a maze of conditional compilation flags.

    The new backend is the default on most major architectures. We encourage users to try it out and report any issues. This evolution ensures that stalld remains a reliable and efficient tool for protecting real-time and high-performance workloads on Linux for years to come.

    Recent Posts

    • stalld’s BPF Backend: Breaking Free from debugfs

    • Running AI inference on Rebellions ATOM NPU with Red Hat AI

    • How we built integration testing for fast-moving AI backend

    • Testing infrastructure red teaming with abliterated models

    • Build an enterprise RAG system with OGX

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.