I work at Red Hat on the GNU Compiler Collection (GCC). In GCC 10, I added the new
-fanalyzer option, a static analysis pass for identifying various problems at compile-time, rather than at runtime. The initial implementation was aimed at early adopters, who found a few bugs, including a security vulnerability: CVE-2020-1967. Bernd Edlinger, who discovered the issue, had to wade through many false positives accompanying the real issue. Other users also managed to get the analyzer to crash on their code.
I've been rewriting the analyzer to address these issues in the next major release, GCC 11. In this article, I describe the steps I'm taking to reduce the number of false positives and make this static analysis tool more robust.
Tracking program states
I've been attempting to fix bugs in
-fanalyzer as they are reported via GCC's Bugzilla instance. The analyzer's state-tracking component in GCC 10 had many crasher bugs. The more bugs I fixed, the more bugs turned up, with no apparent slowdown in the rate of discovery. This suggested to me that I needed to rewrite the component.
I made at least two big mistakes in how I tracked program states in the original
-fanalyzer implementation. These were in how I tracked symbolic values and regions. The GCC 10 implementation attempted to assign unique IDs to these symbolic entities and canonicalize them so that different states could be compared (equivalent entities ought to have the same ID between different states). Unfortunately, there was always one more canonicalization issue.
In the new implementation, I've made these entities singletons. As a result, a unique object now represents the (symbolic) initial value of a particular parameter at a function call at the entry to the analysis. The change to singletons got rid of large amounts of fiddly canonicalization code, using simple pointers instead. The implementation is simpler, faster, and I've been able to fix all of the crasher bugs. (I'm not quite sure what benefit I saw in the original approach, but hindsight is 20/20, I guess.)
The second big change is in what the symbolic values and regions represent. Previously, I represented a mapping to symbolic values, where the keys were symbolic access paths of memory regions. In the new implementation, I've represented the state as mappings of clusters of bit-offsets within memory. These are sometimes concrete (for example, at a specific bit-offset) and sometimes symbolic (such as an array offset where the index is symbolic). This approach does a much better job of handling unions, pointer aliasing, and so forth. Additionally, lots of fiddly bugs "fixed themselves" when I switched to the new implementation, which reassured me that I was on the right track.
Memory leak detection and non-determinism
I had to rewrite memory leak detection for the new implementation completely. That said, the old implementation had many false positives, whereas the new one seems much less prone to them.
Another issue I ran into is non-determinism, where the analyzer's exact behavior would vary from invocation to invocation. At various places, the implementation would iterate though values, and the order of iteration would depend implicitly on precise pointer values due to hashing algorithms. The pointer values can differ due to address-space layout randomization, which led to different results. I've now fixed such logic in the code to ensure that the analyzer's behavior is repeatable from run to run.
Four new warnings
The GCC 10 implementation of
-fanalyzer added 15 warnings:
- Warnings relating to memory management:
- Warnings relating to missing error-checking or misusing NULL pointers:
- Warnings relating to
- Warnings relating to use-after-return from stack frames:
- Unsafe call warning:
- Proof-of-concept warnings:
For GCC 11, I've added four new warnings:
Each of these corresponds to a pre-existing warning implemented in the C and C++ front ends, but with a "
-Wanalyzer" prefix rather than "
-W." As an example,
-Wanalyzer-write-to-const corresponds to
-Wwrite-to-const. It's important to note that the two implementations are slightly different: Whereas the existing warning merely walks the syntax tree of a particular expression, the analyzer variant does an interprocedural path-based analysis, looking for code paths that attempt to write to a
After discussing whether to reuse the existing command-line options for such warnings, I chose to create new options to make it explicit that the warnings are implemented differently. The
-Wanalyzer-prefixed warnings will find more issues, but they are much more expensive at compile-time. (Though you've already paid that price by choosing
In progress: Attributes for marking APIs
GCC has long had
__attribute__((malloc)) for marking an API entry point as being a memory allocator. In previous GCC releases, this was purely a hint to the optimizer's pointer-aliasing logic. The attribute let the optimizer "know" that the pointer returned from the function pointed to different memory than the other pointers being optimized. The optimizer could then eliminate reads from locations that had not been clobbered after a write through the returned pointer.
In GCC 11, this attribute can now take an additional parameter marking which deallocator function should be called on the result. I'm working on generalizing
-fanalyzer to warn about mismatches, leaks, and double-frees for APIs marked with this attribute. So far, however, it's unclear if the results will be useful without many additional attributes. For example, I attempted to use the following attribute to detect a leak in a Linux driver (CVE-2019-19078):
extern struct urb *usb_alloc_urb(int iso_packets, gfp_t mem_flags); extern void usb_free_urb(struct urb *urb);
I added the attribute to mark the
fns as an allocation/deallocation pair, where there is a leak of an
urb on an error-handling path. Unfortunately, various other functions take
struct urb *, and the analyzer conservatively assumes that an
urb passed to them might or might not be freed. It thus stops tracking state for them and only reports the issue if I disable much of the intervening code. This feature needs additional work to be useful except in the simplest cases.
In progress: HTML output
The analyzer's emitted control flow paths can be very verbose, so I've been experimenting with other forms of output. I have an implementation of HTML output, in which the path information is written out to a separate HTML file. Here are a few examples:
- Double-free bug
- Signal handler issue
- Memory leak (due to
k to move forward and back through control-flow events.
Unfortunately, the HTML output doesn't capture the warnings themselves, just the paths. Fixing that would require deep changes to GCC's diagnostics subsystem, which I'm wary of doing at this point in the development cycle. So, I'm not sure I've found the best way to enable the HTML format as an option; it seems better to capture all of the diagnostics somehow as build artifacts, rather than just the paths of those diagnostics that have paths associated with them.
What's next for GCC 11 and -fanalyzer
We're in the bug-fixing phase of GCC 11 development, aiming for a release in the spring of 2021. The analyzer still needs a fair bit of bug-fixing, and we're working on scaling it up. I plan to focus on that for this first part of the new year. (These problems can be related, by the way: Bugs sometimes lead to loop-handling going awry. The analyzer will then attempt to effectively unroll a loop, which leads to hitting a safety limit and a slow, incomplete analysis.)
I am still developing
-fanalyzer only for C in GCC 11. I added partial support for C++'s
deleteBut there are enough missing features that it's not yet worth using on real C++ code. I plan to make the analyzer robust and scalable for C code in GCC 11 and defer C++ support to GCC 12.
GCC 11 will be in Fedora 34, which should also be out in the spring of 2021. For simple code examples, you can play around with the new GCC online at godbolt.org. Select your GCC "trunk" and add
-fanalyzer to the compiler options. Have fun!