Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Clang bytecode interpreter update

October 15, 2025
Timm Baeder
Related topics:
Linux
Related products:
Red Hat Enterprise Linux

Share:

    It’s October again, so let me tell you what happened with the clang bytecode interpreter this year. In case this is the first you've encountered this topic: This is a project for a bytecode interpreter in clang to evaluate constant expressions at compile time. This work is already available in all recent clang versions when -fexperimental-new-constant-interpreter is passed.

    Previous articles in this series:

    • A new constant expression interpreter for Clang (2022)
    • Part 2 (2023)
    • Part 3 (2024)

    According to Git, there were roughly 500 commits since my last blog post. While there was no huge new feature or breakthrough that I can remember now, the implementation got a lot more solid. I’ve also done a good chunk of performance work this year. 

    While we still had approximately 155 failures in the clang test suite last year, we are now down to 90. You can track the progress here. We now have an actual working implementation of builtin_constant_p. And while it does disagree with the current interpreter in a few edge cases, it should be enough for all real-world use cases (see the section about libc++ testing).

    In the remainder of this blog post, let's have a more detailed look at a few of the changes that happened in the last year.

    Optimizing reads from known sources

    When looking at constant expressions (and especially functions), one might have the urge to optimize things away. Let's say we profile the interpreter and figure out that integer additions are particularly inefficient. We might want to check that the right-hand side of a += compound assignment operator is not 0 and not emit any bytecode if it is.

    constexpr int test() {
      int a;
      a += 0;
      a += 1;
      return a;
    }
    static_assert(test() == 1);

    Not doing the first addition here would be faster of course, but if you look closely, you notice that a is left uninitialized, so the first addition already fails. Not emitting it causes the diagnostic to appear for the second addition, which is wrong. You can imagine similar cases for divisions or multiplications by 1, a unary +, or all discarded statements. If they can fail, we need to evaluate them and diagnose failure.

    This is only one example, but in general we need to be cautious with optimizations on this level because we need to diagnose all undefined behavior, and quality of the diagnostic output is  important.

    Let’s instead look at something we can optimize.

    constexpr int G = 12;
    constexpr int test(int P) {
      int L = 8;
      return G + L + P;
    }
    static_assert(test(5) == 25);

    This code is pretty straightforward and should work. It adds up the values of a global variable, a local variable, and a parameter. For all three values, the abstract syntax tree (AST) that clang generates consists of a node to get a pointer to the variable, and one to load from that variable.  In detail, it looks something like this:

          `-ImplicitCastExpr 0x7d5284257a50 <col:18> 'int' <LValueToRValue>
            `-DeclRefExpr 0x7d5284257a28 <col:18> 'int' lvalue ParmVar 0x7d5284257660 'P' 'int'

    Here we see the lvalue for the parameter P and the lvalue-to-rvalue cast that loads the value from the parameter. We used to emit bytecode just like this:

    [..]
    48     GetPtrGlobal      0
    64     LoadPopSint32
    72     GetPtrLocal       40
    88     LoadPopSint32
    96     AddSint32
    104    GetPtrParam       0
    120    LoadPopSint32
    128    AddSint32
    136    Destroy           0
    152    RetSint32
    [..]

    For brevity, I've removed some less interesting opcodes at the beginning and at the end of the function. As you can see from the bytecode, we get a pointer (GetPtrLocal, GetPtrGlobal, GetPtrParam) for all three values, followed by a load of a 32-bit signed integer (LoadPopSint32). Those values are added up and returned from the function. This works just fine, but we can do better.

    Even before this optimization, we had opcodes to get the value of a variable without the need to first get a pointer to it. Instead of a GetPtrGlobal + LoadPop pair, we can instead emit just a GetGlobal opcode, which immediately pushes the value of the global variable on the stack. This reduces traffic on our stack data structure and reduces the amount of pointers we create. The latter is particularly important because we’re always tracking what memory a pointer points to, just in case that memory goes out of scope.

    Because the relevant opcodes already existed, we only needed to do some type checks when generating bytecode and the result looks like this:

    [...]
    48     GetGlobalSint32    0
    64     GetLocalSint32     40
    80     AddSint32
    88     GetParamSint32     0
    104    AddSint32
    112    Destroy            0
    128    RetSint32
    [...]

    This is nicer to read, and also more efficient. When evaluating LoadPop opcode, we don’t know whether the pointer we load from is a global variable, a local variable, a temporary, a parameter, and so on. We need to be conservative, and do all possible checks. If we know where we load from, then we can skip unnecessary work. For example, loading a value from an extern variable doesn’t generally work, so we need to check for this. But local variables cannot be extern and so we can skip this check for the GetLocal opcode.

    libc++ testing

    In January 2025, I started running the libc++ test suite in addition to the clang test suite. The libc++ tests provide more real-world use cases whereas the clang test suite usually stress-tests against edge cases in the C and C++ language specifications. As such, the initial run of the libc++ test suite resulted in a large number of failures (Figure 1).

    Image showing clang and libc++ test suite failures with the bytecode interpreter over time.
    Figure 1: clang and libc++ test suite failures with the bytecode interpreter over time.

    We initially had over 750 test failures. A large chunk of those was caused by missing support for some string builtins. The libc++ test suite extensively tests std::string, after all. Through that and other tests, such as those for std::vector, it also uses dynamic memory allocation at compile time a lot. 

    The libc++ test cases are usually unfortunately exceedingly complex. Fixing them requires a variation of:

    • Letting clang preprocess the test so the result is independent of include files and most command-line options. This usually results in a file that is way too large. The longest one I saw was around 300KLOC.
    • Use a mixture of manual editing and running cvise to reduce the test case. If you’ve never had to do something like this: Good! Otherwise, cvise is great and you should know how to use it.
    • Debugging the reduced test case to figure out the root cause of the problem.

    I often managed to reduce the failing test case to a self-contained test of just a few dozen lines. After three months, we reached the milestone of zero test failures.

    Because libc++ is in active development, it changes constantly. This means we sometimes see regressions in the number of test failures, but that's okay.

    The #embed benchmark

    Let's look at an example benchmark using #embed:

    constexpr char str[] = {
    #embed "sqlite3.c" suffix(,0)
    };
    consteval unsigned checksum(const char *s) {
      unsigned result = 0;
      for (const char *p = s; *p != '\0'; ++p) {
        result += *p;
      }
      return result;
    }
    constexpr unsigned C = checksum(str);

    This test case simply reads the sqlite3 amalgamation (which has almost 9MB) into a global char array, then sums up all the character values into a variable C.

    Let's time this test with the current interpreter and then with the bytecode interpreter. All runs are using a regular release build of clang with assertions disabled but no other compiler flags set.

    hyperfine -r 30 -w 3 'bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000'
    Benchmark 1: bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000
      Time (mean ± σ):     36.490 s ±  0.361 s    [User: 36.161 s, System: 0.299 s]
      Range (min … max):   35.711 s … 37.611 s    30 runs
    hyperfine -r 30 -w 3 'bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter'
    Benchmark 1: bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter
      Time (mean ± σ):     14.799 s ±  0.274 s    [User: 14.484 s, System: 0.303 s]
      Range (min … max):   14.332 s … 15.345 s    30 runs

    So the bytecode interpreter only needs roughly 50% of the time that the current interpreter takes. That's good, but let's look at some of the (Figure 2).

    Bar chart showing the runtimes of the embed benchmark. The bytecode interpreter takes less than half the time.
    Figure 2: The runtimes of the embed benchmark. The bytecode interpreter takes less than half the time.

    Our test case consists of the first declaration of str, which initializes a large char array. That’s not a function call so the bytecode interpreter doesn’t interpret bytecode here at all, it instead evaluates the expressions as they come. This is similar to what the current interpreter is doing, but the bytecode interpreter has some additional overhead. If we look at the runtime for only the initialization, the bytecode interpreter is slower.

    Bar chart showing the runtimes of the embed benchmark. The bytecode interpreter takes less than half the time.
    Figure 3: The runtimes of the embed benchmark without the function call. The bytecode interpreter is slower.

    The difference used to be much worse but I've optimized the bytecode interpreter to be more performant when checking large arrays for initialization.

    Heap allocation benchmark

    Here's another benchmark that I've been looking at in the past year:

    constexpr int x = []() {
      for (unsigned I = 0; I != 10'000; ++I) {
        char *buffer = new char[1024];
        for (unsigned c = 0; c != 1024; ++c)
          buffer[c] = 97 + (c % 26);
        delete[] buffer;
      }
      return 1;
    }();

    This is a pretty simple but useless test benchmarking dynamic memory allocation. We allocate some memory, fill it with some characters, and then we free the memory again. This is all done in a lambda that gets called immediately. The final output is just a variable with the value 1.

    This is a benchmark we can even time against GCC (I was using 14.3.1), so let's do that:

    $ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000'
    Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000
      Time (mean ± σ):     27.792 s ±  0.263 s    [User: 27.732 s, System: 0.036 s]
      Range (min … max):   27.329 s … 28.287 s    30 runs
    $ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter'
    Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter
      Time (mean ± σ):      4.241 s ±  0.103 s    [User: 4.218 s, System: 0.019 s]
      Range (min … max):    4.070 s …  4.504 s    30 runs
    $ hyperfine -r 30 -w 3 '/usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000'
    Benchmark 1: /usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000
      Time (mean ± σ):     43.829 s ±  0.308 s    [User: 42.170 s, System: 1.619 s]
      Range (min … max):   43.201 s … 44.556 s    30 runs

    Figure 4 illustrates the comparison.

    Bar chart showing the runtimes of clang with the current interpreter, clang with the bytecode interpreter and GCC. GCC is the slowest, followed by clang with the current interpreter. The bytecode interpreter is much faster than both of the others.
    Figure 4: The runtimes of clang with the current interpreter, clang with the bytecode interpreter and GCC. GCC is the slowest, followed by clang with the current interpreter. The bytecode interpreter is much faster than both of the others.

    GCC has an interesting optimization that seems to eliminate the heap allocation altogether if it is unused. If we remove the loop initializing the allocated memory (and increase the size from 1024 to 1024 * 1024), GCC is better than both the new and old clang interpreters:

    $ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000'
    Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000
      Time (mean ± σ):     1048.113 s ± 19.781 s    [User: 783.382 s, System: 263.909 s]
      Range (min … max):   1008.151 s … 1082.708 s    30 runs
    $ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter'
    Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter
      Time (mean ± σ):     439.9 ms ±  19.2 ms    [User: 421.7 ms, System: 17.5 ms]
      Range (min … max):   406.7 ms … 475.7 ms    30 runs
    $ hyperfine -r 30 -w 3 '/usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000'
    Benchmark 1: /usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000
      Time (mean ± σ):     116.8 ms ±  23.8 ms    [User: 101.0 ms, System: 15.7 ms]
      Range (min … max):    91.4 ms … 152.4 ms    30 runs

    In this case, clang with the current interpreter is much slower than both GCC and the bytecode interpreter. Note the logarithmic Y axis in Figure 5,

    Bar chart showing the runtimes of clang with the current interpreter, clang with the bytecode interpreter and GCC. GCC is the slowest, followed by clang with the current interpreter. The bytecode interpreter is much faster than both of the others.
    Figure 5: The runtimes of clang with the current interpreter, clang with the bytecode interpreter and GCC. Both clang with the bytecode interpreter and GCC are vastly faster than clang with the current interpreter.

    GCC is almost four times as fast as the bytecode interpreter. It either removes the heap allocation entirely or new + delete expressions are just very efficiently implemented. I have not profiled clang with the current interpreter thoroughly but I can imagine that zero-initializing the APValue (the data structure clang uses to represent compile-time evaluated values) of the array already takes that long.

    Further work

    There are still a few features that need to be implemented, and definitely a few underlying problems. I’ve added a clang:bytecode label in GitHub to track several of them in the upstream LLVM repository. 

    I'll continue working on this project, and I’ve also asked for help on the LLVM Discourse. Other people have also implemented a lot of the recent vector elementwise builtins. So if you’re interested, you can always try to compile your favorite C or C++ project with -fexperimental-new-constant-interpreter and report all problems (or successes!) you find. Any implementation help is also highly appreciated, of course. 

    Part of me hopes that this is the last blog post I write about this topic, but see you next year!

    Related Posts

    • GCC and gcc-toolset versions in RHEL: An explainer

    • How to implement C23 #embed in GCC 15

    • Monitor GCC compile time

    • A practical guide to linker section ordering

    • A new constant expression interpreter for Clang

    • Making memcpy(NULL, NULL, 0) well-defined

    Recent Posts

    • Clang bytecode interpreter update

    • How Red Hat has redefined continuous performance testing

    • Simplify OpenShift installation in air-gapped environments

    • Dynamic GPU slicing with Red Hat OpenShift and NVIDIA MIG

    • Protecting virtual machines from storage and secondary network node failures

    What’s up next?

    Download the Advanced Linux Commands cheat sheet. You'll learn to manage applications and executables in a Linux operating system, define search criteria and query audit logs, set and monitor network access, and more.

    Get the cheat sheet
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue