It's been about a year since our last update about debuginfod
, an HTTP file server that serves debugging resources to debugger-like tools. Since then, we've been busy integrating clients across a range of developer tools and improving the server's available metrics. This article covers the features and improvements we've added to debuginfod
since our last update.
Note: For an introduction to debuginfod
and how to use it, check out our first article introducing debuginfod and the follow-up explaining how to set up your own debuginfod services.
New debuginfod clients
Debuginfod is a part of the elfutils project. Tools that already use elfutils to find or analyze debugging resources automatically inherit debuginfod
support. Tools like Systemtap, Libabigail, and dwgrep all inherit debuginfod
this way. In Systemtap, for example, debuginfod
offers new ways to specify which processes to probe. Previously, if you wanted to explore a running user process, you would have to provide either a process identifier (PID) or the executable path. With debuginfod
, Systemtap can probe processes according to build-id
, as well. So, it is possible to investigate specific versions of a binary independently from the location of the corresponding executable file.
Debuginfod includes a client library (libdebuginfod
) that lets other tools easily query debuginfod
servers for source files, executables, and of course, debuginfo
—generally, DWARF (debugging with attributed record format) debuginfo
. Since last year, a variety of developer tools have integrated debuginfod
clients. As of version 2.34, Binutils includes debuginfod
support for its components that use separate debuginfo
(readelf
and objdump
). Starting in version 9.03, the Annobin project contains debuginfod
support for fetching separate debuginfo
files, and support for Dyninst is planned in version 10.3.
GDB 10.1 was recently released with debuginfod
support, making it easy to download any missing debuginfo
or source files on-the-fly as you debug your programs, whether the files are for the executable being debugged or any shared libraries used by the executable. GDB also uses improvements to the libdebuginfod
API, including programmable progress updates, as shown in the following example (note that this output is abridged for clarity):
$ gdb /usr/bin/python Reading symbols from /usr/bin/python... Downloading separate debug info for /usr/bin/python (gdb) list Downloading source file /usr/src/debug/python3-3.8.6-1.fc32.x86_64/Programs/python.c... 8 wmain (int argc, wchar_t **argv) 9 { 10 return Py_Main(argc, argv) 11 } (gdb) break main Breakpoint 1 at 0x1140: file /usr/src/debug/python3-3.8.6-1.fc32.x86_64/Programs/python.c, line 16. (gdb) run Starting program: /usr/bin/python Downloading separate debug info for /lib64/ld-linux-x86-64.so.2... Downloading separate debug info for /lib64/libc.so.6... Downloading separate debug info for /lib64/libpthread.so.0... [...]
Configuring debuginfod
to supply all of these tools with debugging resources is as simple as setting an environment variable (DEBUGINFOD_URLS
) with the URLs of debuginfod
servers. In case you don't want to set up your own server, we also provide servers that include debugging resources for many common Fedora, CentOS, Ubuntu, Debian, and OpenSUSE packages. For more information, explore the elfutils debuginfod page.
New debuginfod server metrics
Operating a debuginfod
server for other people is a pleasure and a chore. Once you have users, they will expect the service to stay up. While debuginfod
is a simple server, it still needs monitoring and management. With that in mind, debuginfod
comes with the usual logging-to-stderr flags, which are tailor-made for container or systemd operation. (Add another -v
for more information.) Additionally, debuginfod
offers a web API for sharing a variety of metrics about its internal operations. These metrics are exported in Prometheus, which is industry-standard, human-readable, and comes with numerous consumer and processing tools. The metrics are designed to let you see what its various threads are doing, how they're progressing with their workloads, and what types of errors they've encountered. When archived in a time-series database and lightly analyzed, the metrics might help you derive all sorts of neat quantities guiding resource allocation.
Configuring Prometheus for debuginfod
To configure a Prometheus server to scrape debuginfod
metrics, add a clause for HTTP or HTTPS to the prometheus.yml
configuration file, as shown here:
scrape_configs: - job_name: 'debuginfod' scheme: http static_configs: - targets: ['localhost:8002'] - job_name: 'debuginfod-https' scheme: https static_configs: - targets: ['debuginfod.elfutils.org'] # adjust
Adjust the global scrape_interval
if you like. Debuginfod can handle /metrics
queries quickly. Let it run a while, then let's take a tour of the metrics.
Visualizing debuginfod metrics
When debuginfod
is directed to scan a large directory of archives or files for the first time, it uses a pool of threads (-c option
) to decompress and parse them. This activity can be I/O and CPU intensive, and ideally both! How can we tell? Look at the scanned_bytes_total metric, which tabulates the total size of input files debuginfod
processed. When converted to a rate, it is close to the read throughput of the source filesystem.
Note: The following screenshots were generated from built-in Prometheus graphs, but you could use another visualizer like Grafana.
Measuring total bytes scanned
The graph in Figure 1 represents an intensive scan job where a remote NFS server is feeding debuginfod
at a steady 50MBs for some time, then a less impressive 10MBs later on. We believe Monday's arrival was the likely cause for this drop in scanning performance. Developers returned from the weekend and debuginfod
had to share NFS capacity.
As you can see, the initial scan goes on and on. Developers keep developing, but the NFS server runs slower and slower. To analyze that, we can look at the thread_work_pending metric.
Measuring thread activity
The thread_work_pending metric jumps whenever a periodic traversal pass is started (the -t
option and SIGUSR1
) and winds back down to zero as those scanner threads do their work. The graph in Figure 2 represents the five-day period where a multi-terabyte Red Hat Enterprise Linux 8 RPM dataset was scanned. The gentle slope-periods corresponded to a few packages with a unique combination of enormous RPM sizes and many builds (Kernel, RT-Kernel, Ceph, LibreOffice). Sharp upticks and downticks corresponded to concurrent re-traversals that were immediately dismissed because the indexed data was still fresh. As the line touches zero, the scanning is done. After that, only brief pulses should show.
Even before all the scanning is finished, the server is ready to answer queries. This is what it's all about, after all—letting developers enjoy that sweet nectar of debuginfo
. But how many are using it, and at what cost? Let's check the http_responses_total metric, which counts and classifies web API requests.
Measuring HTTP responses
The graph in Figure 3 shows a small peak of errors (unknown build-id
s), a large number of successes (extracting content .rpm), and a very small number of other successes (using the fdcache
). This was the workload from a bulk, distro-wide debuginfod
scan that could not take advantage of any serious caching or prefetching.
Let's take a look at the cost, too. If you measure cost by bytes by network data, pull up the http_responses_transfer_bytes pair of metrics. If measuring cost by CPU time, pull up the http_responses_duration_milliseconds pair of metrics. With a little bit of PromQL, you can compute the average data transfer or processing time.
Measuring processing time, groom statistics and error counts
The graph in Figure 4 shows the duration variant for the same time frame in Figure 3. It reveals how the inability to cache or prefetch the results sometimes required tens of seconds of service time, probably from the same large archives that took so long to scan. Configuring aggressive caching could help to create more typical access patterns. See the metrics that mention fdcache
.
Now that your server is up, it will also periodically groom its index (-g
option and SIGUSR2
). As a part of each groom cycle, another set of metrics is updated to provide an overview of the entire index. The last few numbers give an idea of the storage requirements of a fairly large installation: 6.58TB of RPMs, in 76.6GB of index data:
groom{statistic="archive d/e"} 11837375 groom{statistic="archive sdef"} 152188513 groom{statistic="archive sref"} 2636847754 groom{statistic="buildids"} 11477232 groom{statistic="file d/e"} 0 groom{statistic="file s"} 0 groom{statistic="filenames"} 163330844 groom{statistic="files scanned (#)"} 579264 groom{statistic="files scanned (mb)"} 6583193 groom{statistic="index db size (mb)"} 76662
The error_count
metrics track errors from various subsystems of debuginfod
.
Here, you can see how the errors are categorized by subsystem and type. We hope increases to these metrics can be used to signal a gradual degradation or outright failure. We recommend attaching alerts to them.
error_count{libc="Connection refused"} 3 error_count{libc="No such file or directory"} 1 error_count{libc="Permission denied"} 33 error_count{libarchive="cannot extract file"} 1
Finally, you can use Grafana to scrape the debuginfod
Prometheus server to prepare informative and stylish dashboards, such as the one shown in Figure 5.
Conclusion
This article was an overview of the new client support and metrics available from debuginfod
. We didn't cover all of the available metrics, so feel free to check them out for yourself. If you think of more useful metrics for debuginfod
please get in touch with our developers at elfutils-devel@sourceware.org.