Featured image for: Optimizing the Clang compiler’s line-to-offset mapping.

Recently, I've been trying to improve the speed of the Clang compiler for C and C++. When I profile the Clang pre-processing step on a large file, one function quickly stands out:

clang::LineOffsetMapping::get(llvm::MemoryBufferRef Buffer, llvm::BumpPtrAllocator &Alloc)

This function basically allocates a vector (through Alloc) that maps line numbers to offsets in a file (loaded in Buffer). That's a surprisingly standalone function, so it's easy to extract it in a micro-benchmark and go for an optimization journey. This article is a kind of log book of that trip.

The problem, and the naive implementation

I describe the purpose of the LineOffsetMapping function this way:

Given a file loaded in a string Buffer, create a vector LineOffsets of size N where N is the number of lines in Buffer, so that LineOffsets[I] contains the byte offset where that line begins.

And because cross-platform matters, let's add some fun with:

A file can contain any combination of line delimiters including \n, \r, and \r\n.

A naive implementation—and that's actually the one used in clang 12.0.0—could be:

std::vector<unsigned> LineOffsets;
LineOffsets.push_back(0);

const unsigned char *Buf = (const unsigned char *)Buffer.data();
const std::size_t BufLen = Buffer.size();

unsigned I = 0;
while (I < BufLen) {
  if (Buf[I] == '\n') {
    LineOffsets.push_back(I + 1);
  } else if (Buf[I] == '\r') {
    // If this is \r\n, skip both characters.
    if (I + 1 < BufLen && Buf[I + 1] == '\n')
      ++I;
    LineOffsets.push_back(I + 1);
  }
  ++I;
}

It's simple enough: Read the file, one byte at a time, check its content, and record new lines. But can we do better?

Manual profile-guided optimization

Because the input to Clang is source code, the ratio of newline versus non-newline characters will be in favor of the latter. For instance, in the LLVM source code itself:

$ find llvm -name '*.cpp' -exec cat {} \; | tr -cd '\n' | wc -c
2130902
$ find llvm -name '*.cpp' -exec cat {} \; | wc -c
78537977

That's roughly 2% of newlines.

We can take that property into account using the __builtin_expect(expression, value) intrinsic to guide the optimization process. We can also take advantage of the coincidence that the two characters between \n and \r in the ASCII table—vertical tab (0x0B) and form feed (0x0C)—are pretty uncommon, and use a fast check that catches both newline characters:

unsigned I = 0;
while (I < BufLen) {
  // Use a fast check to catch both newlines
  if (__builtin_expect((Buf[I] - '\n') <= ('\r' - '\n'), 0)) {
    if (Buf[I] == '\n') {
      LineOffsets.push_back(I + 1);
    } else if (Buf[I] == '\r') {
      if (I + 1 < BufLen && Buf[I + 1] == '\n')
        ++I;
      LineOffsets.push_back(I + 1);
    }
  }
  ++I;
}

The fuzzy check is going to fail most of the time, and the compiler knows it, which leads to more efficient code.

Note: It's tempting to use an even faster check, Buf[I] <= '\r'. However, tab characters have the ASCII code 0x09 and would be caught by the check. That would be a good punishment for people using tab characters, but not a very gentle move.

Adding a fast path

As a Linux user, I find it frustrating to pay for these DOS and OS X-style newlines that never make their way into my codebase. It's possible to update the algorithm to do a quick scan for the \r character, and use a fast path if there is none. If the file does have the character, and uses consistent newline style, the character going to be found very quickly, so the cost should be low:

if(!memchr(Buf, '\r', BufLen)) {
  while (I < BufLen) {
    if (__builtin_expect(Buf[I] == '\n', 0)) {
      LineOffsets.push_back(I + 1);
    }
    ++I;
  }
}
else {
// one of the version above... or below
}

Looking through history

The Clang source code contains an optimization hint: At some point, clang::LineOffsetMapping::get had an SSE version, and it got removed for the sake of maintainability in revision d906e731.

This version works hard to use aligned load, which was an important optimization point for older architectures. Since then, the performance cost of unaligned loads on some modern Intel architectures has decreased, so let's try an SSE version that's easier to read and maintain:

#include <emmintrin.h>

// Some renaming to help the reader not familiar with SSE
#define VBROADCAST(v) _mm_set1_epi8(v)
#define VLOAD(v) _mm_loadu_si128((const __m128i*)(v))
#define VOR(x, y) _mm_or_si128(x, y)
#define VEQ(x, y) _mm_cmpeq_epi8(x, y)
#define VMOVEMASK(v) _mm_movemask_epi8(v)

const auto LFs = VBROADCAST('\n');
const auto CRs = VBROADCAST('\r');

while (I + sizeof(LFs) + 1 < BufLen) {
  auto Chunk1 = VLOAD(Buf + I);
  auto Cmp1 = VOR(VEQ(Chunk1, LFs), VEQ(Chunk1, CRs));
  unsigned Mask = VMOVEMASK(Cmp1) ;

  if(Mask) {
    unsigned N = __builtin_ctz(Mask);
    I += N;
    I += ((Buf[I] == '\r') && (Buf[I + 1] == '\n'))? 2 : 1;
    LineOffsets.push_back(I);
  }
  else
    I += sizeof(LFs);
}

In addition to SSE, this version uses the __builtin_ctz builtin that counts the number of trailing zeroes, and thus allows direct access to the matching character. Because we know that this character is either going to be \n or \r, it's also possible to avoid a few comparisons and use a one-liner likely to be optimized to a conditional mov (a.k.a. cmov) on x86.

Bit hacks

The function we're trying to craft shares properties with a plain memchr. Let's pay tribute to the ancient and examine the glibc implementation. It uses a technique similar to vectorization, but with fewer portability issues (at the price of some brain gymnastics), sometimes called multibyte word packing.

Bit hack lovers already know the delicious Hacker's Delight and its close friend Bit Twiddling Hacks. They contain many helpful recipes related to multibyte words, including a way to determine if a word has a byte between m and n. Let's apply that technique to our problem:

template <class T>
static constexpr inline T likelyhasbetween(T x, unsigned char m,
unsigned char n) {
// see http://graphics.stanford.edu/~seander/bithacks.html#HasBetweenInWord
  return (((x) - ~static_cast<T>(0) / 255 * (n)) & ~(x) &
          ((x) & ~static_cast<T>(0) / 255 * 127) + ~static_cast<T>(0) / 255 * (127 - (m))) &  
          ~static_cast<T>(0) / 255 * 128;
}

uint64_t Word;

// scan sizeof(Word) bytes at a time for new lines.
// This is much faster than scanning each byte independently.
if (BufLen > sizeof(Word)) {
  do {
    memcpy(&Word, Buf + I, sizeof(Word));
#if defined(BYTE_ORDER) && defined(BIG_ENDIAN) && BYTE_ORDER == BIG_ENDIAN
    Word = __builtin_bswap64(Word); // some endianness love
#endif

    // no new line => jump over sizeof(Word) bytes.
    auto Mask = likelyhasbetween(Word, '\n' - 1, '\r'+1 );
    if (!Mask) {
      I += sizeof(Word);
      continue;
    }

    // Otherwise scan for the next newline - it's very likely there's one.
    // Note that according to
    // http://graphics.stanford.edu/~seander/bithacks.html#HasBetweenInWord,
    // likelyhasbetween may have false positive for the upper bound.
    unsigned N = __builtin_ctzl(Mask) - 7;
    Word >>= N;
    I += N / 8 + 1;
    unsigned char Byte = Word;
    if (Byte == '\n') {
      LineOffsets.push_back(I);
    } else if (Byte == '\r') {
      // If this is \r\n, skip both characters.
      if (Buf[I] == '\n')
        ++I;
      LineOffsets.push_back(I);
    }
  }
  while (I < BufLen - sizeof(Word) - 1);
}

The code is a bit longer and relies on an undocumented property of the likelyhasbetween function: It sets the byte holding one of the searched values to 0x80. We can use that marker combined with __builtin_ctzl (the long counterpart of the __builtin_ctz builtin you've already met) to point right at the likely match, just like we did for the SSE version.

Crunching numbers

So far, we only explored optimization strategies without doing a single run, profiling, or anything. That's of course not how the travel actually took place. I set up a small repo to gather the various implementations (and a few extras). It provides automation for benchmarking and validating some implementations, using the SQLite amalgamation as input.

Here is the result of the run on my laptop, running a Core i7-8650U and gcc 8.2.1 (timing in milliseconds, average on 100 runs):

ref: 11.37
seq: 11.12
seq_memchr: 6.53
bithack: 4
bithack_scan: 4.78
sse_align: 5.08
sse: 3
sse_memchr: 3.7

ref is the reference implementation. The extra profiling hint and the fuzzy check in seq slightly improve the situation, but with a minor margin. The fast path with a memchr check for \r yields impressive speedup, but only for one specific case. Two versions of bit hacks are compared, showing that the bithack version presented in this article is almost as fast as the SSE version. The legacy SSE version that tries to match alignment constraints, sse_align, does not perform as well as the bithack version, because matching the alignment has a cost. And trying to use a fast path on the SSE version in sse_memchr proves to be counterproductive.

Interestingly, but not surprisingly, the configuration matters a lot. On a Mac Mini (on arm64), using the Apple clang version 12.0.0, the results are quite different:

ref: 6.49
seq: 4
seq_memchr: 4
bithack: 2
bithack_scan: 2.05

The call to memchr doesn't have much of an impact there. The profile-guided version from seq performs significantly better, and both bithack versions perform three times as fast as the reference version.

Conclusion

A patch has been submitted to LLVM to use the bithack version. Although it does not perform as well as the SSE version, the reviewers emphasize the benefit of having an architecture agnostic version that consistently outperforms the reference implementation, and the bithack version matches these criteria.

For more information on Clang and its LLVM back-end, please visit the Red Hat Developer topic page.

Acknowledgments

I'd like to thanks Adrien Guinet for his precious and accurate review of this article o/.

Last updated: February 5, 2024