## Improving math performance in glibc

**Update:** Vincent Lefèvre helpfully pointed out that I had linked to the incorrect Worst Cases paper. That link is now fixed.

**Update 2:** Dan Courcy pointed out that my equation in the “Multiplying zeroes” section had an error, which I have now fixed.

Mathematical function implementations usually have to trade off between speed of computation and the accuracy of the result. This is especially true for transcendentals (i.e. the exponential and trigonometric functions), where results often have to be computed to a fairly large precision to get last bit accuracy in a result that is to be stored in an IEEE-754 double variable.

Transcendental functions in glibc are implemented as multiple phases. The first phases of computation use a lookup table of pre-computed values and a polynomial approximation, a combination of which gives an accurate result for a majority of inputs. If it is found that the lookup table may not give an accurate result the next, slower phase is employed. This phase uses a multiple precision representation to compute results to precisions of up to 768 bits before rounding the result to double. As expected, this kills performance; the slowest path for `pow`

for example is a few thousand times slower than the table lookup phase.

I looked into this problem not very long ago and tried to improve performance of the multiple precision phase so that it is not as bad as it currently is. The result of that was an improvement of about 8 times in the performance of the slowest path of the `pow`

function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements went into glibc-2.18 upstream. We’ll take a look at what these changes were but first, a quick look at the multiple precision number for context on the changes.

## The multiple precision number

The multiple precision number is implemented in glibc as a structure with an integral exponent and an array of numbers for the mantissa.

typedef struct { int e; double d[40]; } mp_no;

The exponent e can be any integer value and the array d represents the mantissa. Each number in d is a non-negative integer less than 2^{24}. Only 32 of the 40 digits are used (the rest being there for additional precision for rounding), giving a maximum precision of 768 bits. The first question one may have is why the mantissa digits are double and not int if their values are going to be only integral. We’ll get to that point later, when we look at one of the relevant improvements.

The multiple precision bits implement addition, subtraction and multiplication as primitives – division is implemented as multiplication of one number with the inverse of the other. A simple analysis of test programs using oprofile or gprof show that the slow path performance is heavily dependent on the performance of the multiplication function. As much as 97% of the time in computation of `pow`

is spent in the multiplication function! So it is obvious that the first place to look at to get improvements is the multiplication routine.

The multiplication routine is actually quite simple, but that is its bane. It emulates the long multiplication algorithm (the method most of us must have learned in school) using a nested loop on the mantissa digits. The inner loop adds the results of the multiplication of individual digits along the column, divides the result by 2^{24}, keeps the remainder and adds the quotient to the higher digit. The outer loop traverses across the digits. To make the algorithm simple, it is assumed that all of the 32 digits are in use and hence the loop runs over all of the 32 digits.

## Multiplying zeroes is obviously a waste

When a number is converted from double to the mp_no format, only 3 mantissa digits are in use. Initial computations don’t use much more than that and in fact for smaller multiple precision phases (240 bits for example), only 10 digits are in use. Despite that, the multiplication algorithm iterates over all of the 32 digits all the time, which is wasteful. So the first optimization was to find the longest digit sequence among the two multiplicands and iterate only up to that point. This is an extra iteration, but for every run of length m on digit sequence length n, we save n^{2} – m^{2} – 2 * m operations.

## Squaring is a special case of multiplication

The long form multiplication algorithm has a neat sequence, which allows us to loop through the digits in nested loops to come up with the result. The sequence in the inner loop is of the form:

R[k] = SUM(X[i] * Y[j])

where i goes from 1 to k – 1 and j goes the other way, from k – 1 to 1. It is easy to see that if X and Y were the same, i.e. we were squaring a number, we just have to run the inner loop halfway and then double the result, adding in the remaining digit if there are an odd number of digits. Since squaring was not an uncommon operation, I wrote a separate function for it that was a special case of multiplication.

## Karatsuba-influenced multiplication

Let us look at the SUM operation above a bit closer. It can be written as:

R[k] = X[1] * Y[k - 1] + X[2] * Y[K - 2] + ... + X[k - 1] * Y[1]

rearranging the terms and grouping them, we can write the above as:

R[k] = (X[1]*Y[k - 1] + X[k-1]*Y[1]) + (X[2]*Y[k-2] + X[k-2]*Y[2]) + ...

Now, we know that `(X[i] + Y[i]) * (X[j] + Y[j])`

is `X[i]*X[j] + Y[i]*Y[j] + X[i]*Y[j] + X[j]*Y[i]`

, so we can write the above sum as:

R[k] = (X[1] + Y[1])*(X[k-1] + Y[k-1]) + (X[2] + Y[2])*(X[k-2] + Y[k-2]) + ... - SUM(X[i]*Y[i])

where i goes from 1 to k-1. With this notation, I could precompute the smaller summation into an array and then the remaining computations reduce by aout 20%, resulting in a net improvement in performance.

## Faster polynomial evaluation for exp

This improvement was specific to the `exp`

function, but the result improves `pow`

too, since `pow`

uses `exp`

. The old implementation computed the Taylor series result as follows:

e^{x}= 1 + x (1 + x / 2 (1 + x / 3 (1 + x/4 ...)))

This involved one multiple precision multiplication and one multiple precision division in each iteration. This can be improved by computing the result as the following instead:

e^{x}= 1 + (x * (n!/1! + x * (n!/2! + x * (n!/3! + x ...)))) / n!

Here, the factorials can be computed on the fly with a single primitive multiplication, which leaves just one multiple precision multiplication.

## Configurable mantissa types

We finally come to the mantissa types problem that I had alluded to in the beginning of the post. The mantissa digits were doubles even though they were only required to store and operate on integers. Even the intermediate operations would at best require int64_t storage to accommodate intermediate values of digits during multiplication. I started by replacing those with long, but found that performance on powerpc-based systems dropped. This reason for this is that a Power processor has 4 floating point units for a single fixed point unit, making floating point operations heavily parallelizable.

So the solution in the end was to have configurable types for mantissa digits, with powerpc sticking to the double type and everything else using int64_t. This again gave a fairly big boost to performance on x86 (~30%) since fixed point computation is significantly faster on x86.

## The road ahead

This is definitely not the end of the road for improvements in the multiple precision performance. However, I believe I have picked a lot of the low hanging fruit. There are other more complicated improvements, like the limitation of worst case precision for `exp`

and `log`

functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. One needs to prove that those results apply to the glibc multiple precision bits.

Join the Red Hat Developer Program (it’s free) and get access to related cheat sheets, books, and product downloads.