Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Improving math performance in glibc

January 2, 2015
Siddhesh Poyarekar

Share:

    Update: Vincent Lefèvre helpfully pointed out that I had linked to the incorrect Worst Cases paper. That link is now fixed.

    Update 2: Dan Courcy pointed out that my equation in the "Multiplying zeroes" section had an error, which I have now fixed.

    Mathematical function implementations usually have to trade off between speed of computation and the accuracy of the result. This is especially true for transcendentals (i.e. the exponential and trigonometric functions), where results often have to be computed to a fairly large precision to get last bit accuracy in a result that is to be stored in an IEEE-754 double variable.

    gnu logoTranscendental functions in glibc are implemented as multiple phases. The first phases of computation use a lookup table of pre-computed values and a polynomial approximation, a combination of which gives an accurate result for a majority of inputs. If it is found that the lookup table may not give an accurate result the next, slower phase is employed. This phase uses a multiple precision representation to compute results to precisions of up to 768 bits before rounding the result to double. As expected, this kills performance; the slowest path for pow for example is a few thousand times slower than the table lookup phase.

    I looked into this problem not very long ago and tried to improve performance of the multiple precision phase so that it is not as bad as it currently is. The result of that was an improvement of about 8 times in the performance of the slowest path of the pow function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements went into glibc-2.18 upstream. We'll take a look at what these changes were but first, a quick look at the multiple precision number for context on the changes.

    The multiple precision number

    The multiple precision number is implemented in glibc as a structure with an integral exponent and an array of numbers for the mantissa.

    typedef struct {
      int e;
      double d[40];
    } mp_no;
    

    The exponent e can be any integer value and the array d represents the mantissa. Each number in d is a non-negative integer less than 224. Only 32 of the 40 digits are used (the rest being there for additional precision for rounding), giving a maximum precision of 768 bits. The first question one may have is why the mantissa digits are double and not int if their values are going to be only integral. We'll get to that point later, when we look at one of the relevant improvements.

    The multiple precision bits implement addition, subtraction and multiplication as primitives - division is implemented as multiplication of one number with the inverse of the other. A simple analysis of test programs using oprofile or gprof show that the slow path performance is heavily dependent on the performance of the multiplication function. As much as 97% of the time in computation of pow is spent in the multiplication function! So it is obvious that the first place to look at to get improvements is the multiplication routine.

    The multiplication routine is actually quite simple, but that is its bane. It emulates the long multiplication algorithm (the method most of us must have learned in school) using a nested loop on the mantissa digits. The inner loop adds the results of the multiplication of individual digits along the column, divides the result by 224, keeps the remainder and adds the quotient to the higher digit. The outer loop traverses across the digits. To make the algorithm simple, it is assumed that all of the 32 digits are in use and hence the loop runs over all of the 32 digits.

    Multiplying zeroes is obviously a waste

    When a number is converted from double to the mp_no format, only 3 mantissa digits are in use. Initial computations don't use much more than that and in fact for smaller multiple precision phases (240 bits for example), only 10 digits are in use. Despite that, the multiplication algorithm iterates over all of the 32 digits all the time, which is wasteful. So the first optimization was to find the longest digit sequence among the two multiplicands and iterate only up to that point. This is an extra iteration, but for every run of length m on digit sequence length n, we save n2 - m2 - 2 * m operations.

    Squaring is a special case of multiplication

    The long form multiplication algorithm has a neat sequence, which allows us to loop through the digits in nested loops to come up with the result. The sequence in the inner loop is of the form:

    R[k] = SUM(X[i] * Y[j])

    where i goes from 1 to k - 1 and j goes the other way, from k - 1 to 1. It is easy to see that if X and Y were the same, i.e. we were squaring a number, we just have to run the inner loop halfway and then double the result, adding in the remaining digit if there are an odd number of digits. Since squaring was not an uncommon operation, I wrote a separate function for it that was a special case of multiplication.

    Karatsuba-influenced multiplication

    Let us look at the SUM operation above a bit closer. It can be written as:

    R[k] = X[1] * Y[k - 1] + X[2] * Y[K - 2] + ... + X[k - 1] * Y[1]

    rearranging the terms and grouping them, we can write the above as:

    R[k] = (X[1]*Y[k - 1] + X[k-1]*Y[1]) + (X[2]*Y[k-2] + X[k-2]*Y[2]) + ...

    Now, we know that (X[i] + Y[i]) * (X[j] + Y[j]) is X[i]*X[j] + Y[i]*Y[j] + X[i]*Y[j] + X[j]*Y[i], so we can write the above sum as:

    R[k] = (X[1] + Y[1])*(X[k-1] + Y[k-1]) + (X[2] + Y[2])*(X[k-2] + Y[k-2]) + ...
           - SUM(X[i]*Y[i])

    where i goes from 1 to k-1. With this notation, I could precompute the smaller summation into an array and then the remaining computations reduce by aout 20%, resulting in a net improvement in performance.

    Faster polynomial evaluation for exp

    This improvement was specific to the exp function, but the result improves pow too, since pow uses exp. The old implementation computed the Taylor series result as follows:

    ex = 1 + x (1 + x / 2 (1 + x / 3 (1 + x/4 ...)))

    This involved one multiple precision multiplication and one multiple precision division in each iteration. This can be improved by computing the result as the following instead:

    ex = 1 + (x * (n!/1! + x * (n!/2! + x * (n!/3! + x ...)))) / n!

    Here, the factorials can be computed on the fly with a single primitive multiplication, which leaves just one multiple precision multiplication.

    Configurable mantissa types

    We finally come to the mantissa types problem that I had alluded to in the beginning of the post. The mantissa digits were doubles even though they were only required to store and operate on integers. Even the intermediate operations would at best require int64_t storage to accommodate intermediate values of digits during multiplication. I started by replacing those with long, but found that performance on powerpc-based systems dropped. This reason for this is that a Power processor has 4 floating point units for a single fixed point unit, making floating point operations heavily parallelizable.

    So the solution in the end was to have configurable types for mantissa digits, with powerpc sticking to the double type and everything else using int64_t. This again gave a fairly big boost to performance on x86 (~30%) since fixed point computation is significantly faster on x86.

    The road ahead

    This is definitely not the end of the road for improvements in the multiple precision performance. However, I believe I have picked a lot of the low hanging fruit. There are other more complicated improvements, like the limitation of worst case precision for exp and log functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. One needs to prove that those results apply to the glibc multiple precision bits.

    Recent Posts

    • Expand Model-as-a-Service for secure enterprise AI

    • OpenShift LACP bonding performance expectations

    • Build container images in CI/CD with Tekton and Buildpacks

    • How to deploy OpenShift AI & Service Mesh 3 on one cluster

    • JVM tuning for Red Hat Data Grid on Red Hat OpenShift 4

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue