Red Hat Enterprise Linux 8.2 brings faster Python 3.8 run speeds

Red Hat Enterprise Linux 8.2 brings faster Python 3.8 run speeds

The Python interpreter shipped with Red Hat Enterprise Linux (RHEL) 8 is version 3.6, which was released in 2016. While Red Hat is committed to supporting the Python 3.6 interpreter for the lifetime of Red Hat Enterprise Linux 8, it is becoming a bit old for some use cases.

For developers who need the new Python features—and who can live with the inevitable compatibility-breaking changes—Red Hat Enterprise Linux 8.2 also includes Python 3.8. Besides providing new features, packaging Python 3.8 with RHEL 8.2 allows us to release performance and packaging improvements more quickly than we could in the rock-solid python3 module.

This article focuses on one specific performance improvement in the python38 package. As we’ll explain, Python 3.8 is built with the GNU Compiler Collection (GCC)’s -fno-semantic-interposition flag. Enabling this flag disables semantic interposition, which can increase run speed by as much as 30%.

Note: The python38 package joins other Python interpreters shipped in RHEL 8.2, including the python2 and python3 packages (which we described in a previous article, Python in RHEL 8). You can install Python 3.8 alongside the other Python interpreters so that it won’t interfere with the existing Python stack.

Where have I seen this before?

Writing this article feels like taking credit for others’ achievements. So, let us set this straight: The performance improvements we’re discussing are others’ achievements. As RHEL packagers, our role is similar to that of a gallery curator, rather than a painter: It is not our job to create features, but to seek out the best ones from the upstream Python project and combine them into a pleasing experience for developers after they’ve gone through review, integration, and testing in Fedora.

Note that we do have “painter” roles on the team. But just as fresh paint does not belong in an exhibition hall, original contributions go to the broader community first and only appear in RHEL when they’re well-tested (that is, somewhat boring and obvious).

The discussions leading to the change we describe in this article include an initial naïve proposal by Red Hat’s Python maintainers, a critique, a better idea by C expert Jan Kratochvil, and refining that idea. All of this back-and-forth happened openly on the Fedora development mailing list, with input from both Red Hatters and the wider community.

Everything you need to grow your career.

With your free Red Hat Developer program membership, unlock our library of cheat sheets and ebooks on next-generation application development.


Disabling semantic interposition in Python 3.8

As we’ve mentioned, the most significant performance improvement in our RHEL 8.2 python38 package comes from building with GCC’s -fno-semantic-interposition flag enabled. It increases run speed by as much as 30%, with little change to the semantics.

How is that possible? There are a few layers to it, so let us explain.

Python’s C API

All of Python’s functionality is exposed in its extensive C API. A large part of Python’s success comes from the C API, which makes it possible to extend and embed Python. Extensions are modules written in a language like C, which can provide functionality to Python programs. A classic example is NumPy, a library written in languages like C and Fortran that manipulates Python objects. Embedding means using Python from within a larger application. Applications like Blender or GIMP embed Python to allow scripting.

Python (or more correctly, CPython, the reference implementation of the Python language) uses the C API internally: Every attribute access goes through a call to the PyObject_GetAttr function, every addition is a call to PyNumber_Add, and so on.

Python’s dynamic library

Python can be built in two modes: static, where all code lives in the Python executable, or shared, where the Python executable is linked to its dynamic library called libpython. In Red Hat Enterprise Linux, Python is built in shared mode, because applications that embed Python, like Blender, use the Python C API of libpython.

The python3.8 command is a minimalist example of embedding: It only calls the Py_BytesMain() function:

main(int argc, char **argv)
    return Py_BytesMain(argc, argv);

All the code lives in libpython. For example, on RHEL 8.2, the size of /usr/bin/python3.8 is just around 8 KiB, whereas the size of the /usr/lib64/ library is around 3.6 MiB.

Semantic interposition

When executing a program, the dynamic loader allows you to override any symbol (such as a function) of the dynamic libraries that will be used in the program. You implement the override by setting the LD_PRELOAD environment variable. This technique is called ELF symbol interposition, and it’s enabled by default in GCC.

Note: In Clang, semantic interposition is disabled by default.

This feature is commonly used, among other things, to trace memory allocation (by overriding the libc malloc and free functions) or to change a single application’s clocks (by overriding the libc time function). Semantic interposition is implemented using a procedure linkage table (PLT). Any function that can be overridden with LD_PRELOAD is looked up in a table before it is called.

Python calls libpython functions from other libpython functions. To respect semantic interposition, all of these calls must be looked up in the PLT. While this activity does introduce some overhead, the slowdown is negligible compared to the time spent in the called functions.

Note: Python uses the tracemalloc module to trace memory allocations.

LTO and function inlining

In recent years, GCC has enhanced link-time optimization (LTO) to produce even more efficient code. One common optimization is to inline function calls, which means replacing a function call with a copy of the function’s code. Once a function call is inlined, the compiler can go even further in terms of optimizations.

However, it is not possible to inline functions that are looked up in the PLT. If the function can be swapped out entirely using LD_PRELOAD, the compiler cannot apply assumptions and optimizations based on what that function does.

GCC 5.3 introduced the -fno-semantic-interposition flag, which disables semantic interposition. With this flag, functions in libpython that call other libpython functions don’t have to go through the PLT indirection anymore. As a result, they can be inlined and optimized with LTO.

So, that’s what we did. We enabled the -fno-semantic-interposition flag in Python 3.8.

Drawbacks of -fno-semantic-interposition

The main drawback of building Python with -fno-semantic-interposition enabled is that we can no longer override libpython functions using LD_PRELOAD. However, the impact is limited to libpython. It is still possible, for example, to override malloc/free from libc to trace memory allocations.

However, this is still an incompatibility: We do not know if developers are using LD_PRELOAD with Python on RHEL 8 in a way that would break with -fno-semantic-interposition. That is why we only enabled the change in the new Python 3.8, while Python 3.6—the default python3—continues to work as before.

Performance comparison

To see the -fno-semantic-interposition optimization in practice, let’s take a look at the _Py_CheckFunctionResult() function. This function is used by Python to check whether a C function either returned a result (is not NULL) or raised an exception.

Here is the simplified C code:

    PyThreadState *tstate = _PyRuntime.gilstate.tstate_current;
    return tstate->curexc_type;

_Py_CheckFunctionResult(PyObject *callable, PyObject *result,
                        const char *where)
    int err_occurred = (PyErr_Occurred() != NULL);

Assembly code with semantic interposition enabled

Let’s first take a look at Python 3.6 in Red Hat Enterprise Linux 7, which has not been built with -fno-semantic-interposition. Here is an extract of the assembly code (read by’s disassemble command):

Dump of assembler code for function _Py_CheckFunctionResult:
callq  0x7ffff7913d50 <PyErr_Occurred@plt>

As you can see, _Py_CheckFunctionResult() calls PyErr_Occurred(), and the call has to go through a PLT indirection.

Assembly code with semantic interposition disabled

Now let’s look at an extract of the same assembly code after disabling semantic interposition:

Dump of assembler code for function _Py_CheckFunctionResult:
mov 0x40f7fe(%rip),%rcx # rcx = &_PyRuntime
mov 0x558(%rcx),%rsi    # rsi = tstate = _PyRuntime.gilstate.tstate_current
mov 0x58(%rsi),%rdi     # rdi = tstate->curexc_type

In this case, GCC inlined the PyErr_Occurred() function call. As a result _Py_CheckFunctionResult() gets the tstate directly from _PyRuntime, and then it directly reads its member tstate->curexc_type. There is no function call and no PLT indirection, which results in faster performance.

Note: In more complex situations, the GCC compiler is free to optimize the inlined function even more, according to the context in which it is called.

Try it for yourself!

In this article, we focused on one specific improvement on the performance side, leaving new features to the upstream documents What’s new In Python 3.7 and What’s new In Python 3.8. If you are intrigued by the new compiler performance possibilities in Python 3.8, grab the python38 package from the Red Hat Enterprise Linux 8 repository and try it out. We hope you will enjoy the run speed-up, as well as a host of other new features that you will discover for yourself.