For years, stalld has been a critical tool for maintaining stability in real-time and CPU-isolated Linux environments. Its core function is to detect and mitigate task starvation, particularly for kernel threads that get stuck waiting for a chance to run on a CPU core dominated by a high-priority, user-space application. The original mechanism for this detection was straightforward but crude: stalld would periodically read and parse the text output of /sys/kernel/debug/sched/debug.
While functional, this approach was always a compromise. Relying on debugfs meant treating its output as a default API, even though it was explicitly designed for debugging and was subject to change without notice. Each significant kernel update brought the risk of format changes that would break our parser, leading to a brittle and reactive development cycle. Furthermore, the polling-based model was inefficient, imposing unnecessary overhead and creating windows where short-lived starvation events could be missed entirely.
The technical evolution of the Linux kernel, specifically the maturation of BPF and BPF CO-RE (compile once - run everywhere), presented a clear path forward. We could move from a polling, text-parsing model to a lightweight, event-driven one. This post details the design and implementation of the new BPF-based queue_track backend for stalld. This new backend frees stalld from the fragile dependency on debugfs, making it safe to run on Secure Boot and locked-down systems.
What is stalld?
To understand the motivation for the BPF backend, you need to understand the problem stalld solves: Task starvation on systems with isolated CPUs. In high-performance computing (HPC) and network function virtualization (NFV) workloads, particularly those using frameworks like DPDK, it's common practice to dedicate one or more CPU cores to a single, high-priority application. This is done by setting the CPU affinity of the task and its threads, effectively creating a private CPU for the application.
This isolation is excellent for application performance, because it minimizes context switches and cache pollution. However, it creates a subtle but severe problem. The Linux kernel still needs to perform housekeeping tasks on all CPUs, including the isolated ones. Kernel threads (kworkers), migration threads, and other essential services must occasionally run on these cores. When a user-space application is spinning at 100% on an isolated core, these kernel tasks can get "stuck" in the CPU's runqueue, waiting indefinitely for the user-space task to yield. This is task starvation, and it can lead to system instability, soft lockups, and unpredictable behavior.
The solution stalld offers is to monitor the runqueues of each CPU. When it detects a task that has been waiting for an excessive amount of time, it intervenes. The intervention is a temporary, powerful priority boost. stalld changes the scheduling policy of the starved task to SCHED_DEADLINE, the highest priority scheduling class in Linux. This guarantees the starved task gets to run almost immediately, perform its duties, and then yield the CPU back to the main application. Once the task completes its work and goes to sleep, stalld restores its original scheduling policy.
To make this decision, stalld must collect key information about each task in a CPU's runqueue. At its core, this data is simple:
struct task_info {
pid_t pid;
int tgid;
int prio;
time_t since; // When task entered runqueue
long ctxsw;
};The pid identifies the task, and since identifies how long it has been waiting. The challenge, and the subject of this post, has always been how to collect this data efficiently and reliably.
The old way: The sched_debug backend
The original stalld implementation, now known as the sched_debug backend, relied on the only interface available at the time that exposed the contents of CPU runqueues: The scheduler's debug file. This file, located at /proc/sched_debug or, more commonly, /sys/kernel/debug/sched/debug, provides a human-readable snapshot of the scheduler's state for the entire system.
The operational loop was simple:
- Open and read the entire contents of
sched/debuginto a buffer. - Iterate through the buffer line-by-line.
- Use a complex string-parsing state machine to identify CPU sections and the tasks listed within their runqueues.
- For each task, extract the relevant fields (PID, priority,and so on).
- If a task's wait time exceeded a configured threshold, then flag it for boosting.
- Sleep for a polling interval, and repeat.
This approach had several significant drawbacks. The most immediate one was the lack of good tooling to handle kernel evolution. Kernel internals change between versions — this is true whether you're parsing text from debugfs or reading kernel data structures directly. The difference lies in how you manage those changes. With debugfs, we had no choice but to implement multiple parsers and runtime format detection. Over the years, the sched/debug format changed multiple times: we needed one parser for kernel 3.x, another for 4.18, and yet another for 6.12. The code accumulated version-specific if/else branches, string patterns, and detection logic. Each new kernel version risked breaking stalld until we manually patched it.
The second major issue was performance. The polling-based approach is inherently inefficient. At every interval,` stalld` would wake up, read a potentially large text file from the kernel, and spend CPU cycles parsing thousands of lines of text, most of which described tasks that were not starving. This overhead scales linearly with the number of tasks on the system, making it less suitable for machines with high process counts.
Finally, the snapshot-in-time nature of polling is a fundamental limitation. If a kernel thread was enqueued and subsequently dequeued between two stalld polls, its starvation event would be completely missed. For short but critical housekeeping tasks, this visibility gap was a known risk that we had to accept. The old way worked, but it was far from ideal.
The new way: The BPF queue_track backend
The new queue_track backend replaces the fragile, polling-based sched_debug parser with a modern, event-driven architecture built on BPF. This represents a fundamental shift in how stalld gathers data.
Why BPF?
BPF allows us to run sandboxed, custom programs within the Linux kernel itself. Instead of asking the kernel for a giant text dump and parsing it in user space, we can now place small, highly efficient BPF programs at strategic points in the kernel's scheduler code. These programs extract only the data we need, precisely when we need it.
The advantages of this approach are numerous:
- Event-driven: We attach BPF programs to scheduler tracepoints. These programs run automatically when scheduler events occur (like a task waking up or a context switch happening). This eliminates polling entirely and ensures we never miss an event.
- Structured data: The BPF program writes the data it collects into a shared memory structure called a BPF map. The
stallduser-space daemon simply reads from this structured map. The communication between kernel and user space is clean, binary, and efficient. - Low overhead: The BPF programs are JIT-compiled into native machine code and are extremely fast. They perform minimal work — just a few data reads and a write to a map. The overhead on the scheduler is negligible, and the CPU usage of the
stallddaemon is dramatically reduced. - Portability with BPF CO-RE: Kernel data structures change between versions — this is the same fundamental challenge we faced with
debugfs, but BPF CO-RE provides far better tooling to handle it. CO-RE uses BPF type format (BTF) information, which is now included with most modern kernels, to understand the layout of kernel structures at BPF program load time. The BPF loader automatically patches field offsets and handles structural differences. This allows a single, compiled BPF program to run correctly across a wide range of kernel versions. The source code stays clean — no version-specific parsers, no runtime detection logic, no #ifdef maze.
Tracepoint architecture
To accurately track a task's time in a runqueue, we need to monitor its entire lifecycle from the scheduler's perspective. We achieved this by attaching BPF programs to a few key tracepoints:
sched_wakeup / sched_wakeup_new: These tracepoints are triggered when a task transitions into a runnable state. This is our primary entry point. When a task wakes up,stalldrecords its PID and the current timestamp in our BPF map, marking the beginning of its potential wait time.sched_switch: This is the most important tracepoint. It fires on every context switch. When a task is switched out (preempted),stalldchecks whether it's still runnable. If so, it's going back into the runqueue, and we start timing it. When a task is switched in (begins executing), we know it's no longer waiting, so we remove it from our tracking map.sched_process_exit: When a task exits,stalldmust remove any associated tracking data from our maps to prevent stale entries. This tracepoint allows for clean-up.sched_migrate_task: Tasks can be moved between CPU runqueues. This tracepoint allows our BPF program to update its per-CPU data structures to reflect the task's new location, ensuring that our accounting remains accurate.
Key code examples
The queue_track backend is implemented as a module that conforms to a generic stalld backend interface. This design allows users to switch between the new and old backends easily.
The interface is defined as follows:
struct stalld_backend {
int (*init)(void);
int (*get)(char *buffer, int size);
int (*get_cpu)(char *buffer, int size, int cpu);
int (*parse)(struct cpu_info *cpu_info, char *buffer, size_t buffer_size);
int (*has_starving_task)(struct cpu_info *cpu);
void (*destroy)(void);
};The power of BPF CO-RE is best demonstrated by how we handle changes in core kernel structures like task_struct. For instance, the member pointing to the task's CPU changed its location inside a substructure across kernel versions. With CO-RE, we can write portable code to access it without any compile-time branching:
struct task_struct___legacy {
int cpu;
unsigned int state;
};
static inline int task_cpu(const struct task_struct *p)
{
const struct task_struct___legacy *lp = (const void *) p;
const struct thread_info___legacy *lt = (const void *) &p->thread_info;
// lp and lt are BPF CO-RE pointers to the task_struct
return bpf_core_field_exists(lp->cpu)
? BPF_CORE_READ(lp, cpu) // Newer kernels
: BPF_CORE_READ(lt, cpu); // Older kernels
}At load time, the BPF verifier checks which field exists in the target kernel's task_struct and patches the code to use the correct accessor.
The heart of the BPF side is the map that shares data with user space. We use an array map, where each entry in the array corresponds to a CPU core and holds the data for tasks waiting on that core.
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 1024); // Max CPUs supported
__type(key, u32);
__type(value, struct stalld_cpu_data);
} stalld_per_cpu_data SEC(".maps");The user-space stalld process simply needs to read the stalld_cpu_data struct for the CPU it is interested in (although the queue_track backend is event based, the user space part of stalld relies on polling due to compatibility with the sched_debug backend). All complex logic of tracking tasks is handled efficiently inside the kernel by our BPF programs.
Practical usage
The queue_track backend is now the default on all supported architectures. You can verify which backend is active by running stalld in verbose mode. For example, verify the active backend (this shows queue_track on supported systems):
stalld -vFor debugging or comparison purposes, you can manually force stalld to use a specific backend with the -b flag. For example, to force the BPF backend:
stalld -b queue_trackTo force the legacy debugfs backend:
stalld -b sched_debugThe BPF backend depends on the kernel being built with BTF support, which is necessary for BPF CO-RE. Most modern distributions enable this by default. You can check whether your system has BTF info with this command:
ls /sys/kernel/btf/vmlinuxIf the file /sys/kernel/btf/vmlinux exists, then the queue_track backend should work. The current support matrix looks like this:
Architecture with BPF support (using the queue_track backend by default):
- x86_64
- aarch64
- s390x
- riscv64
- ppc64le
Architecture without BPF support (using the sched_debug backend):
- i686
The i686 architecture is typically limited by the availability of BPF toolchain support (like libbpf) in its standard build environments, not a fundamental incompatibility with BPF itself.
Performance comparison
The theoretical benefits of the BPF approach are borne out in practice. The queue_track backend is superior to sched_debug across every meaningful metric:
| Metric | sched_debug (Old) | queue_track (New) |
|---|---|---|
| Data Source | Text file (debugfs) | Binary BPF map |
| Collection Model | Polling | Event-driven |
| Parsing Overhead | O(n) string parsing | O(n) map lookup |
| Missed Events | Possible | None |
| debugfs Dependency | Yes | No |
The most significant gains come from the shift to an event-driven model and the elimination of text parsing work in its main loop. It simply reads structured data from the BPF map when it needs to check for starving tasks. The removal of the debugfs dependency also makes stalld more robust and deployable in environments where debugfs might be unavailable.
Conclusion
The new queue_track backend marks a significant milestone for the stalld project. By leveraging BPF and BPF CO-RE, we have replaced a fragile, inefficient mechanism with a solution that is purpose-built for high-performance system monitoring. The benefits are clear: Near-zero overhead, complete visibility into scheduler events, and cross-kernel portability that was previously unimaginable.
This work eliminates the long-standing technical debt of parsing debugfs and provides a solid foundation for future enhancements. BPF CO-RE, in particular, has proven to be the key technology for writing portable BPF applications, allowing us to support a wide range of kernel versions without resorting to a maze of conditional compilation flags.
The new backend is the default on most major architectures. We encourage users to try it out and report any issues. This evolution ensures that stalld remains a reliable and efficient tool for protecting real-time and high-performance workloads on Linux for years to come.