Debuginfod project update: New clients and metrics

Debuginfod project update: New clients and metrics

It’s been about a year since our last update about debuginfod, an HTTP file server that serves debugging resources to debugger-like tools. Since then, we’ve been busy integrating clients across a range of developer tools and improving the server’s available metrics. This article covers the features and improvements we’ve added to debuginfod since our last update.

Note: For an introduction to debuginfod and how to use it, check out our first article introducing debuginfod and the follow-up explaining how to set up your own debuginfod services.

New debuginfod clients

Debuginfod is a part of the elfutils project. Tools that already use elfutils to find or analyze debugging resources automatically inherit debuginfod support. Tools like Systemtap, Libabigail, and dwgrep all inherit debuginfod this way. In Systemtap, for example, debuginfod offers new ways to specify which processes to probe. Previously, if you wanted to explore a running user process, you would have to provide either a process identifier (PID) or the executable path. With debuginfod, Systemtap can probe processes according to build-id, as well. So, it is possible to investigate specific versions of a binary independently from the location of the corresponding executable file.

Debuginfod includes a client library (libdebuginfod) that lets other tools easily query debuginfod servers for source files, executables, and of course, debuginfo—generally, DWARF (debugging with attributed record format) debuginfo. Since last year, a variety of developer tools have integrated debuginfod clients. As of version 2.34, Binutils includes debuginfod support for its components that use separate debuginfo (readelf and objdump). Starting in version 9.03, the Annobin project contains debuginfod support for fetching separate debuginfo files, and support for Dyninst is planned in version 10.3.

GDB 10.1 was recently released with debuginfod support, making it easy to download any missing debuginfo or source files on-the-fly as you debug your programs, whether the files are for the executable being debugged or any shared libraries used by the executable. GDB also uses improvements to the libdebuginfod API, including programmable progress updates, as shown in the following example (note that this output is abridged for clarity):

$ gdb /usr/bin/python
Reading symbols from /usr/bin/python...
Downloading separate debug info for /usr/bin/python
(gdb) list
Downloading source file /usr/src/debug/python3-3.8.6-1.fc32.x86_64/Programs/python.c...
8 wmain (int argc, wchar_t **argv)
9 {
10 return Py_Main(argc, argv)
11 }
(gdb) break main
Breakpoint 1 at 0x1140: file /usr/src/debug/python3-3.8.6-1.fc32.x86_64/Programs/python.c, line 16.
(gdb) run
Starting program: /usr/bin/python
Downloading separate debug info for /lib64/ld-linux-x86-64.so.2...
Downloading separate debug info for /lib64/libc.so.6...
Downloading separate debug info for /lib64/libpthread.so.0...
[...]

Configuring debuginfod to supply all of these tools with debugging resources is as simple as setting an environment variable (DEBUGINFOD_URLS) with the URLs of debuginfod servers. In case you don’t want to set up your own server, we also provide servers that include debugging resources for many common Fedora, CentOS, Ubuntu, Debian, and OpenSUSE packages.  For more information, explore the elfutils debuginfod page.

New debuginfod server metrics

Operating a debuginfod server for other people is a pleasure and a chore. Once you have users, they will expect the service to stay up. While debuginfod is a simple server, it still needs monitoring and management. With that in mind, debuginfod comes with the usual logging-to-stderr flags, which are tailor-made for container or systemd operation. (Add another -v for more information.) Additionally, debuginfod offers a web API for sharing a variety of metrics about its internal operations. These metrics are exported in Prometheus, which is industry-standard, human-readable, and comes with numerous consumer and processing tools. The metrics are designed to let you see what its various threads are doing, how they’re progressing with their workloads, and what types of errors they’ve encountered. When archived in a time-series database and lightly analyzed, the metrics might help you derive all sorts of neat quantities guiding resource allocation.

Configuring Prometheus for debuginfod

To configure a Prometheus server to scrape debuginfod metrics, add a clause for HTTP or HTTPS to the prometheus.yml configuration file, as shown here:

     scrape_configs:
       - job_name: 'debuginfod'
         scheme: http
         static_configs:
         - targets: ['localhost:8002']
       - job_name: 'debuginfod-https'
         scheme: https
         static_configs:
         - targets: ['debuginfod.elfutils.org'] # adjust

Adjust the global scrape_interval if you like. Debuginfod can handle /metrics queries quickly. Let it run a while, then let’s take a tour of the metrics.

Visualizing debuginfod metrics

When debuginfod is directed to scan a large directory of archives or files for the first time, it uses a pool of threads (-c option) to decompress and parse them. This activity can be I/O and CPU intensive, and ideally both! How can we tell? Look at the scanned_bytes_total metric, which tabulates the total size of input files debuginfod processed. When converted to a rate, it is close to the read throughput of the source filesystem.

Note: The following screenshots were generated from built-in Prometheus graphs, but you could use another visualizer like Grafana.

Measuring total bytes scanned

The graph in Figure 1 represents an intensive scan job where a remote NFS server is feeding debuginfod at a steady 50MBs for some time, then a less impressive 10MBs later on. We believe Monday’s arrival was the likely cause for this drop in scanning performance. Developers returned from the weekend and debuginfod had to share NFS capacity.

The graph shows a sudden drop in scanning performance.

Figure 1: Results from debuginfod’s scanned_bytes_total metric displayed in a Prometheus graph.

As you can see, the initial scan goes on and on. Developers keep developing, but the NFS server runs slower and slower. To analyze that, we can look at the thread_work_pending metric.

Measuring thread activity

The thread_work_pending metric jumps whenever a periodic traversal pass is started (the -t option and SIGUSR1) and winds back down to zero as those scanner threads do their work. The graph in Figure 2 represents the five-day period where a multi-terabyte Red Hat Enterprise Linux 8 RPM dataset was scanned. The gentle slope-periods corresponded to a few packages with a unique combination of enormous RPM sizes and many builds (Kernel, RT-Kernel, Ceph, LibreOffice). Sharp upticks and downticks corresponded to concurrent re-traversals that were immediately dismissed because the indexed data was still fresh. As the line touches zero, the scanning is done. After that, only brief pulses should show.

This graph shows sharp upticks and downticks.

Figure 2: Results from debuginfod’s thread_work_pending metric displayed in a Prometheus graph.

Even before all the scanning is finished, the server is ready to answer queries. This is what it’s all about, after all—letting developers enjoy that sweet nectar of debuginfo. But how many are using it, and at what cost? Let’s check the http_responses_total metric, which counts and classifies web API requests.

Measuring HTTP responses

The graph in Figure 3 shows a small peak of errors (unknown build-ids), a large number of successes (extracting content .rpm), and a very small number of other successes (using the fdcache). This was the workload from a bulk, distro-wide debuginfod scan that could not take advantage of any serious caching or prefetching.

The graph shows a sharp incline and a gradual decline.

Figure 3: Results from debuginfod’s http_responses_total metric displayed in a Prometheus graph.

Let’s take a look at the cost, too. If you measure cost by bytes by network data, pull up the http_responses_transfer_bytes pair of metrics. If measuring cost by CPU time, pull up the http_responses_duration_milliseconds pair of metrics. With a little bit of PromQL, you can compute the average data transfer or processing time.

Measuring processing time, groom statistics and error counts

The graph in Figure 4 shows the duration variant for the same time frame in Figure 3. It reveals how the inability to cache or prefetch the results sometimes required tens of seconds of service time, probably from the same large archives that took so long to scan. Configuring aggressive caching could help to create more typical access patterns. See the metrics that mention fdcache.

need alt text.

Figure 4: Measuring processing time with debuginfod metrics in Prometheus.

Now that your server is up, it will also periodically groom its index (-g option and SIGUSR2). As a part of each groom cycle, another set of metrics is updated to provide an overview of the entire index. The last few numbers give an idea of the storage requirements of a fairly large installation: 6.58TB of RPMs, in 76.6GB of index data:

        groom{statistic="archive d/e"} 11837375
        groom{statistic="archive sdef"} 152188513
        groom{statistic="archive sref"} 2636847754
        groom{statistic="buildids"} 11477232
        groom{statistic="file d/e"} 0
        groom{statistic="file s"} 0
        groom{statistic="filenames"} 163330844
        groom{statistic="files scanned (#)"} 579264
        groom{statistic="files scanned (mb)"} 6583193
        groom{statistic="index db size (mb)"} 76662

The error_count metrics track errors from various subsystems of debuginfod.

Here, you can see how the errors are categorized by subsystem and type. We hope increases to these metrics can be used to signal a gradual degradation or outright failure. We recommend attaching alerts to them.

        error_count{libc="Connection refused"}  3
        error_count{libc="No such file or directory"}   1
        error_count{libc="Permission denied"}   33
        error_count{libarchive="cannot extract file"}   1

Finally, you can use Grafana to scrape the debuginfod Prometheus server to prepare informative and stylish dashboards, such as the one shown in Figure 5.

The dashboard displays a variety of debuginfod metrics.

Figure 5: Debuginfod metrics displayed on a Grafana dashboard.

Conclusion

This article was an overview of the new client support and metrics available from debuginfod. We didn’t cover all of the available metrics, so feel free to check them out for yourself. If you think of more useful metrics for debuginfod please get in touch with our developers at elfutils-devel@sourceware.org.

Share