It’s October again, so let me tell you what happened with the clang bytecode interpreter this year. In case this is the first you've encountered this topic: This is a project for a bytecode interpreter in clang to evaluate constant expressions at compile time. This work is already available in all recent clang versions when -fexperimental-new-constant-interpreter is passed.

Previous articles in this series:

According to Git, there were roughly 500 commits since my last blog post. While there was no huge new feature or breakthrough that I can remember now, the implementation got a lot more solid. I’ve also done a good chunk of performance work this year.

While we still had approximately 155 failures in the clang test suite last year, we are now down to 90. You can track the progress here. We now have an actual working implementation of builtin_constant_p. And while it does disagree with the current interpreter in a few edge cases, it should be enough for all real-world use cases (see the section about libc++ testing).

In the remainder of this blog post, let's have a more detailed look at a few of the changes that happened in the last year.

Optimizing reads from known sources

When looking at constant expressions (and especially functions), one might have the urge to optimize things away. Let's say we profile the interpreter and figure out that integer additions are particularly inefficient. We might want to check that the right-hand side of a += compound assignment operator is not 0 and not emit any bytecode if it is.

constexpr int test() {
  int a;
  a += 0;
  a += 1;
  return a;
}
static_assert(test() == 1);

Not doing the first addition here would be faster of course, but if you look closely, you notice that a is left uninitialized, so the first addition already fails. Not emitting it causes the diagnostic to appear for the second addition, which is wrong. You can imagine similar cases for divisions or multiplications by 1, a unary +, or all discarded statements. If they can fail, we need to evaluate them and diagnose failure.

This is only one example, but in general we need to be cautious with optimizations on this level because we need to diagnose all undefined behavior, and quality of the diagnostic output is important.

Let’s instead look at something we can optimize.

constexpr int G = 12;
constexpr int test(int P) {
  int L = 8;
  return G + L + P;
}
static_assert(test(5) == 25);

This code is pretty straightforward and should work. It adds up the values of a global variable, a local variable, and a parameter. For all three values, the abstract syntax tree (AST) that clang generates consists of a node to get a pointer to the variable, and one to load from that variable. In detail, it looks something like this:

      `-ImplicitCastExpr 0x7d5284257a50 <col:18> 'int' <LValueToRValue>
        `-DeclRefExpr 0x7d5284257a28 <col:18> 'int' lvalue ParmVar 0x7d5284257660 'P' 'int'

Here we see the lvalue for the parameter P and the lvalue-to-rvalue cast that loads the value from the parameter. We used to emit bytecode just like this:

[..]
48     GetPtrGlobal      0
64     LoadPopSint32
72     GetPtrLocal       40
88     LoadPopSint32
96     AddSint32
104    GetPtrParam       0
120    LoadPopSint32
128    AddSint32
136    Destroy           0
152    RetSint32
[..]

For brevity, I've removed some less interesting opcodes at the beginning and at the end of the function. As you can see from the bytecode, we get a pointer (GetPtrLocal, GetPtrGlobal, GetPtrParam) for all three values, followed by a load of a 32-bit signed integer (LoadPopSint32). Those values are added up and returned from the function. This works just fine, but we can do better.

Even before this optimization, we had opcodes to get the value of a variable without the need to first get a pointer to it. Instead of a GetPtrGlobal + LoadPop pair, we can instead emit just a GetGlobal opcode, which immediately pushes the value of the global variable on the stack. This reduces traffic on our stack data structure and reduces the amount of pointers we create. The latter is particularly important because we’re always tracking what memory a pointer points to, just in case that memory goes out of scope.

Because the relevant opcodes already existed, we only needed to do some type checks when generating bytecode and the result looks like this:

[...]
48     GetGlobalSint32    0
64     GetLocalSint32     40
80     AddSint32
88     GetParamSint32     0
104    AddSint32
112    Destroy            0
128    RetSint32
[...]

This is nicer to read, and also more efficient. When evaluating LoadPop opcode, we don’t know whether the pointer we load from is a global variable, a local variable, a temporary, a parameter, and so on. We need to be conservative, and do all possible checks. If we know where we load from, then we can skip unnecessary work. For example, loading a value from an extern variable doesn’t generally work, so we need to check for this. But local variables cannot be extern and so we can skip this check for the GetLocal opcode.

libc++ testing

In January 2025, I started running the libc++ test suite in addition to the clang test suite. The libc++ tests provide more real-world use cases whereas the clang test suite usually stress-tests against edge cases in the C and C++ language specifications. As such, the initial run of the libc++ test suite resulted in a large number of failures (Figure 1).

Image showing clang and libc++ test suite failures with the bytecode interpreter over time. — Figure 1: clang and libc++ test suite failures with the bytecode interpreter over time.

We initially had over 750 test failures. A large chunk of those was caused by missing support for some string builtins. The libc++ test suite extensively tests std::string, after all. Through that and other tests, such as those for std::vector, it also uses dynamic memory allocation at compile time a lot.

The libc++ test cases are usually unfortunately exceedingly complex. Fixing them requires a variation of:

Letting clang preprocess the test so the result is independent of include files and most command-line options. This usually results in a file that is way too large. The longest one I saw was around 300KLOC.
Use a mixture of manual editing and running cvise to reduce the test case. If you’ve never had to do something like this: Good! Otherwise, cvise is great and you should know how to use it.
Debugging the reduced test case to figure out the root cause of the problem.

I often managed to reduce the failing test case to a self-contained test of just a few dozen lines. After three months, we reached the milestone of zero test failures.

Because libc++ is in active development, it changes constantly. This means we sometimes see regressions in the number of test failures, but that's okay.

The #embed benchmark

Let's look at an example benchmark using #embed:

constexpr char str[] = {
#embed "sqlite3.c" suffix(,0)
};
consteval unsigned checksum(const char *s) {
  unsigned result = 0;
  for (const char *p = s; *p != '\0'; ++p) {
    result += *p;
  }
  return result;
}
constexpr unsigned C = checksum(str);

This test case simply reads the sqlite3 amalgamation (which has almost 9MB) into a global char array, then sums up all the character values into a variable C.

Let's time this test with the current interpreter and then with the bytecode interpreter. All runs are using a regular release build of clang with assertions disabled but no other compiler flags set.

hyperfine -r 30 -w 3 'bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000'
Benchmark 1: bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000
  Time (mean ± σ):     36.490 s ±  0.361 s    [User: 36.161 s, System: 0.299 s]
  Range (min … max):   35.711 s … 37.611 s    30 runs
hyperfine -r 30 -w 3 'bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter'
Benchmark 1: bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter
  Time (mean ± σ):     14.799 s ±  0.274 s    [User: 14.484 s, System: 0.303 s]
  Range (min … max):   14.332 s … 15.345 s    30 runs

So the bytecode interpreter only needs roughly 50% of the time that the current interpreter takes. That's good, but let's look at some of the (Figure 2).

Bar chart showing the runtimes of the embed benchmark. The bytecode interpreter takes less than half the time. — Figure 2: The runtimes of the embed benchmark. The bytecode interpreter takes less than half the time.

Our test case consists of the first declaration of str, which initializes a large char array. That’s not a function call so the bytecode interpreter doesn’t interpret bytecode here at all, it instead evaluates the expressions as they come. This is similar to what the current interpreter is doing, but the bytecode interpreter has some additional overhead. If we look at the runtime for only the initialization, the bytecode interpreter is slower.

The difference used to be much worse but I've optimized the bytecode interpreter to be more performant when checking large arrays for initialization.

Heap allocation benchmark

Here's another benchmark that I've been looking at in the past year:

constexpr int x = []() {
  for (unsigned I = 0; I != 10'000; ++I) {
    char *buffer = new char[1024];
    for (unsigned c = 0; c != 1024; ++c)
      buffer[c] = 97 + (c % 26);
    delete[] buffer;
  }
  return 1;
}();

This is a pretty simple but useless test benchmarking dynamic memory allocation. We allocate some memory, fill it with some characters, and then we free the memory again. This is all done in a lambda that gets called immediately. The final output is just a variable with the value 1.

This is a benchmark we can even time against GCC (I was using 14.3.1), so let's do that:

$ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000'
Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000
  Time (mean ± σ):     27.792 s ±  0.263 s    [User: 27.732 s, System: 0.036 s]
  Range (min … max):   27.329 s … 28.287 s    30 runs
$ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter'
Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter
  Time (mean ± σ):      4.241 s ±  0.103 s    [User: 4.218 s, System: 0.019 s]
  Range (min … max):    4.070 s …  4.504 s    30 runs
$ hyperfine -r 30 -w 3 '/usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000'
Benchmark 1: /usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000
  Time (mean ± σ):     43.829 s ±  0.308 s    [User: 42.170 s, System: 1.619 s]
  Range (min … max):   43.201 s … 44.556 s    30 runs

Figure 4 illustrates the comparison.

Bar chart showing the runtimes of clang with the current interpreter, clang with the bytecode interpreter and GCC. GCC is the slowest, followed by clang with the current interpreter. The bytecode interpreter is much faster than both of the others. — Figure 4: The runtimes of clang with the current interpreter, clang with the bytecode interpreter and GCC. GCC is the slowest, followed by clang with the current interpreter. The bytecode interpreter is much faster than both of the others.

GCC has an interesting optimization that seems to eliminate the heap allocation altogether if it is unused. If we remove the loop initializing the allocated memory (and increase the size from 1024 to 1024 * 1024), GCC is better than both the new and old clang interpreters:

$ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000'
Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000
  Time (mean ± σ):     1048.113 s ± 19.781 s    [User: 783.382 s, System: 263.909 s]
  Range (min … max):   1008.151 s … 1082.708 s    30 runs
$ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter'
Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter
  Time (mean ± σ):     439.9 ms ±  19.2 ms    [User: 421.7 ms, System: 17.5 ms]
  Range (min … max):   406.7 ms … 475.7 ms    30 runs
$ hyperfine -r 30 -w 3 '/usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000'
Benchmark 1: /usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000
  Time (mean ± σ):     116.8 ms ±  23.8 ms    [User: 101.0 ms, System: 15.7 ms]
  Range (min … max):    91.4 ms … 152.4 ms    30 runs

In this case, clang with the current interpreter is much slower than both GCC and the bytecode interpreter. Note the logarithmic Y axis in Figure 5,

GCC is almost four times as fast as the bytecode interpreter. It either removes the heap allocation entirely or new + delete expressions are just very efficiently implemented. I have not profiled clang with the current interpreter thoroughly but I can imagine that zero-initializing the APValue (the data structure clang uses to represent compile-time evaluated values) of the array already takes that long.

Further work

There are still a few features that need to be implemented, and definitely a few underlying problems. I’ve added a clang:bytecode label in GitHub to track several of them in the upstream LLVM repository.

I'll continue working on this project, and I’ve also asked for help on the LLVM Discourse. Other people have also implemented a lot of the recent vector elementwise builtins. So if you’re interested, you can always try to compile your favorite C or C++ project with -fexperimental-new-constant-interpreter and report all problems (or successes!) you find. Any implementation help is also highly appreciated, of course.

Part of me hopes that this is the last blog post I write about this topic, but see you next year!

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Clang bytecode interpreter update

Optimizing reads from known sources

libc++ testing

The #embed benchmark

Heap allocation benchmark

Further work

Introducing Models-as-a-Service in OpenShift AI

Building domain-specific LLMs with synthetic data and SDG Hub

External IP visibility in Red Hat Advanced Cluster Security

How I used Red Hat Lightspeed image builder to create CIS (and more) compliant images

Building a oversaturation detector with iterative error analysis

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue