Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Optimizing the Clang compiler’s line-to-offset mapping

May 4, 2021
Serge Guelton
Related topics:
C, C#, C++Linux
Related products:
Developer Toolset

    Recently, I've been trying to improve the speed of the Clang compiler for C and C++. When I profile the Clang pre-processing step on a large file, one function quickly stands out:

    clang::LineOffsetMapping::get(llvm::MemoryBufferRef Buffer, llvm::BumpPtrAllocator &Alloc)

    This function basically allocates a vector (through Alloc) that maps line numbers to offsets in a file (loaded in Buffer). That's a surprisingly standalone function, so it's easy to extract it in a micro-benchmark and go for an optimization journey. This article is a kind of log book of that trip.

    The problem, and the naive implementation

    I describe the purpose of the LineOffsetMapping function this way:

    Given a file loaded in a string Buffer, create a vector LineOffsets of size N where N is the number of lines in Buffer, so that LineOffsets[I] contains the byte offset where that line begins.

    And because cross-platform matters, let's add some fun with:

    A file can contain any combination of line delimiters including \n, \r, and \r\n.

    A naive implementation—and that's actually the one used in clang 12.0.0—could be:

    std::vector<unsigned> LineOffsets;
    LineOffsets.push_back(0);
    
    const unsigned char *Buf = (const unsigned char *)Buffer.data();
    const std::size_t BufLen = Buffer.size();
    
    unsigned I = 0;
    while (I < BufLen) {
      if (Buf[I] == '\n') {
        LineOffsets.push_back(I + 1);
      } else if (Buf[I] == '\r') {
        // If this is \r\n, skip both characters.
        if (I + 1 < BufLen && Buf[I + 1] == '\n')
          ++I;
        LineOffsets.push_back(I + 1);
      }
      ++I;
    }
    

    It's simple enough: Read the file, one byte at a time, check its content, and record new lines. But can we do better?

    Manual profile-guided optimization

    Because the input to Clang is source code, the ratio of newline versus non-newline characters will be in favor of the latter. For instance, in the LLVM source code itself:

    $ find llvm -name '*.cpp' -exec cat {} \; | tr -cd '\n' | wc -c
    2130902
    $ find llvm -name '*.cpp' -exec cat {} \; | wc -c
    78537977

    That's roughly 2% of newlines.

    We can take that property into account using the __builtin_expect(expression, value) intrinsic to guide the optimization process. We can also take advantage of the coincidence that the two characters between \n and \r in the ASCII table—vertical tab (0x0B) and form feed (0x0C)—are pretty uncommon, and use a fast check that catches both newline characters:

    unsigned I = 0;
    while (I < BufLen) {
      // Use a fast check to catch both newlines
      if (__builtin_expect((Buf[I] - '\n') <= ('\r' - '\n'), 0)) {
        if (Buf[I] == '\n') {
          LineOffsets.push_back(I + 1);
        } else if (Buf[I] == '\r') {
          if (I + 1 < BufLen && Buf[I + 1] == '\n')
            ++I;
          LineOffsets.push_back(I + 1);
        }
      }
      ++I;
    }
    

    The fuzzy check is going to fail most of the time, and the compiler knows it, which leads to more efficient code.

    Note: It's tempting to use an even faster check, Buf[I] <= '\r'. However, tab characters have the ASCII code 0x09 and would be caught by the check. That would be a good punishment for people using tab characters, but not a very gentle move.

    Adding a fast path

    As a Linux user, I find it frustrating to pay for these DOS and OS X-style newlines that never make their way into my codebase. It's possible to update the algorithm to do a quick scan for the \r character, and use a fast path if there is none. If the file does have the character, and uses consistent newline style, the character going to be found very quickly, so the cost should be low:

    if(!memchr(Buf, '\r', BufLen)) {
      while (I < BufLen) {
        if (__builtin_expect(Buf[I] == '\n', 0)) {
          LineOffsets.push_back(I + 1);
        }
        ++I;
      }
    }
    else {
    // one of the version above... or below
    }
    

    Looking through history

    The Clang source code contains an optimization hint: At some point, clang::LineOffsetMapping::get had an SSE version, and it got removed for the sake of maintainability in revision d906e731.

    This version works hard to use aligned load, which was an important optimization point for older architectures. Since then, the performance cost of unaligned loads on some modern Intel architectures has decreased, so let's try an SSE version that's easier to read and maintain:

    #include <emmintrin.h>
    
    // Some renaming to help the reader not familiar with SSE
    #define VBROADCAST(v) _mm_set1_epi8(v)
    #define VLOAD(v) _mm_loadu_si128((const __m128i*)(v))
    #define VOR(x, y) _mm_or_si128(x, y)
    #define VEQ(x, y) _mm_cmpeq_epi8(x, y)
    #define VMOVEMASK(v) _mm_movemask_epi8(v)
    
    const auto LFs = VBROADCAST('\n');
    const auto CRs = VBROADCAST('\r');
    
    while (I + sizeof(LFs) + 1 < BufLen) {
      auto Chunk1 = VLOAD(Buf + I);
      auto Cmp1 = VOR(VEQ(Chunk1, LFs), VEQ(Chunk1, CRs));
      unsigned Mask = VMOVEMASK(Cmp1) ;
    
      if(Mask) {
        unsigned N = __builtin_ctz(Mask);
        I += N;
        I += ((Buf[I] == '\r') && (Buf[I + 1] == '\n'))? 2 : 1;
        LineOffsets.push_back(I);
      }
      else
        I += sizeof(LFs);
    }
    

    In addition to SSE, this version uses the __builtin_ctz builtin that counts the number of trailing zeroes, and thus allows direct access to the matching character. Because we know that this character is either going to be \n or \r, it's also possible to avoid a few comparisons and use a one-liner likely to be optimized to a conditional mov (a.k.a. cmov) on x86.

    Bit hacks

    The function we're trying to craft shares properties with a plain memchr. Let's pay tribute to the ancient and examine the glibc implementation. It uses a technique similar to vectorization, but with fewer portability issues (at the price of some brain gymnastics), sometimes called multibyte word packing.

    Bit hack lovers already know the delicious Hacker's Delight and its close friend Bit Twiddling Hacks. They contain many helpful recipes related to multibyte words, including a way to determine if a word has a byte between m and n. Let's apply that technique to our problem:

    template <class T>
    static constexpr inline T likelyhasbetween(T x, unsigned char m,
    unsigned char n) {
    // see http://graphics.stanford.edu/~seander/bithacks.html#HasBetweenInWord
      return (((x) - ~static_cast<T>(0) / 255 * (n)) & ~(x) &
              ((x) & ~static_cast<T>(0) / 255 * 127) + ~static_cast<T>(0) / 255 * (127 - (m))) &  
              ~static_cast<T>(0) / 255 * 128;
    }
    
    uint64_t Word;
    
    // scan sizeof(Word) bytes at a time for new lines.
    // This is much faster than scanning each byte independently.
    if (BufLen > sizeof(Word)) {
      do {
        memcpy(&Word, Buf + I, sizeof(Word));
    #if defined(BYTE_ORDER) && defined(BIG_ENDIAN) && BYTE_ORDER == BIG_ENDIAN
        Word = __builtin_bswap64(Word); // some endianness love
    #endif
    
        // no new line => jump over sizeof(Word) bytes.
        auto Mask = likelyhasbetween(Word, '\n' - 1, '\r'+1 );
        if (!Mask) {
          I += sizeof(Word);
          continue;
        }
    
        // Otherwise scan for the next newline - it's very likely there's one.
        // Note that according to
        // http://graphics.stanford.edu/~seander/bithacks.html#HasBetweenInWord,
        // likelyhasbetween may have false positive for the upper bound.
        unsigned N = __builtin_ctzl(Mask) - 7;
        Word >>= N;
        I += N / 8 + 1;
        unsigned char Byte = Word;
        if (Byte == '\n') {
          LineOffsets.push_back(I);
        } else if (Byte == '\r') {
          // If this is \r\n, skip both characters.
          if (Buf[I] == '\n')
            ++I;
          LineOffsets.push_back(I);
        }
      }
      while (I < BufLen - sizeof(Word) - 1);
    }

    The code is a bit longer and relies on an undocumented property of the likelyhasbetween function: It sets the byte holding one of the searched values to 0x80. We can use that marker combined with __builtin_ctzl (the long counterpart of the __builtin_ctz builtin you've already met) to point right at the likely match, just like we did for the SSE version.

    Crunching numbers

    So far, we only explored optimization strategies without doing a single run, profiling, or anything. That's of course not how the travel actually took place. I set up a small repo to gather the various implementations (and a few extras). It provides automation for benchmarking and validating some implementations, using the SQLite amalgamation as input.

    Here is the result of the run on my laptop, running a Core i7-8650U and gcc 8.2.1 (timing in milliseconds, average on 100 runs):

    ref: 11.37
    seq: 11.12
    seq_memchr: 6.53
    bithack: 4
    bithack_scan: 4.78
    sse_align: 5.08
    sse: 3
    sse_memchr: 3.7
    

    ref is the reference implementation. The extra profiling hint and the fuzzy check in seq slightly improve the situation, but with a minor margin. The fast path with a memchr check for \r yields impressive speedup, but only for one specific case. Two versions of bit hacks are compared, showing that the bithack version presented in this article is almost as fast as the SSE version. The legacy SSE version that tries to match alignment constraints, sse_align, does not perform as well as the bithack version, because matching the alignment has a cost. And trying to use a fast path on the SSE version in sse_memchr proves to be counterproductive.

    Interestingly, but not surprisingly, the configuration matters a lot. On a Mac Mini (on arm64), using the Apple clang version 12.0.0, the results are quite different:

    ref: 6.49
    seq: 4
    seq_memchr: 4
    bithack: 2
    bithack_scan: 2.05
    

    The call to memchr doesn't have much of an impact there. The profile-guided version from seq performs significantly better, and both bithack versions perform three times as fast as the reference version.

    Conclusion

    A patch has been submitted to LLVM to use the bithack version. Although it does not perform as well as the SSE version, the reviewers emphasize the benefit of having an architecture agnostic version that consistently outperforms the reference implementation, and the bithack version matches these criteria.

    For more information on Clang and its LLVM back-end, please visit the Red Hat Developer topic page.

    Acknowledgments

    I'd like to thanks Adrien Guinet for his precious and accurate review of this article o/.

    Last updated: February 5, 2024

    Recent Posts

    • Protect data offloaded to GPU-accelerated environments with OpenShift sandboxed containers

    • Case study: Measuring energy efficiency on the x64 platform

    • How to prevent AI inference stack silent failures

    • Preventing GPU waste: A guide to JIT checkpointing with Kubeflow Trainer on OpenShift AI

    • How to manage TLS certificates used by OpenShift GitOps operator

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.