The Python interpreter shipped with Red Hat Enterprise Linux (RHEL) 8 is version 3.6, which was released in 2016. While Red Hat is committed to supporting the Python 3.6 interpreter for the lifetime of Red Hat Enterprise Linux 8, it is becoming a bit old for some use cases.
For developers who need the new Python features—and who can live with the inevitable compatibility-breaking changes—Red Hat Enterprise Linux 8.2 also includes Python 3.8. Besides providing new features, packaging Python 3.8 with RHEL 8.2 allows us to release performance and packaging improvements more quickly than we could in the rock-solid python3
module.
This article focuses on one specific performance improvement in the python38
package. As we'll explain, Python 3.8 is built with the GNU Compiler Collection (GCC)'s -fno-semantic-interposition
flag. Enabling this flag disables semantic interposition, which can increase run speed by as much as 30%.
Note: The python38
package joins other Python interpreters shipped in RHEL 8.2, including the python2
and python3
packages (which we described in a previous article, Python in RHEL 8). You can install Python 3.8 alongside the other Python interpreters so that it won't interfere with the existing Python stack.
Where have I seen this before?
Writing this article feels like taking credit for others' achievements. So, let us set this straight: The performance improvements we're discussing are others' achievements. As RHEL packagers, our role is similar to that of a gallery curator, rather than a painter: It is not our job to create features, but to seek out the best ones from the upstream Python project and combine them into a pleasing experience for developers after they've gone through review, integration, and testing in Fedora.
Note that we do have "painter" roles on the team. But just as fresh paint does not belong in an exhibition hall, original contributions go to the broader community first and only appear in RHEL when they're well-tested (that is, somewhat boring and obvious).
The discussions leading to the change we describe in this article include an initial naïve proposal by Red Hat's Python maintainers, a critique, a better idea by C expert Jan Kratochvil, and refining that idea. All of this back-and-forth happened openly on the Fedora development mailing list, with input from both Red Hatters and the wider community.
Disabling semantic interposition in Python 3.8
As we've mentioned, the most significant performance improvement in our RHEL 8.2 python38
package comes from building with GCC's -fno-semantic-interposition
flag enabled. It increases run speed by as much as 30%, with little change to the semantics.
How is that possible? There are a few layers to it, so let us explain.
Python's C API
All of Python's functionality is exposed in its extensive C API. A large part of Python's success comes from the C API, which makes it possible to extend and embed Python. Extensions are modules written in a language like C, which can provide functionality to Python programs. A classic example is NumPy, a library written in languages like C and Fortran that manipulates Python objects. Embedding means using Python from within a larger application. Applications like Blender or GIMP embed Python to allow scripting.
Python (or more correctly, CPython, the reference implementation of the Python language) uses the C API internally: Every attribute access goes through a call to the PyObject_GetAttr
function, every addition is a call to PyNumber_Add
, and so on.
Python's dynamic library
Python can be built in two modes: static, where all code lives in the Python executable, or shared, where the Python executable is linked to its dynamic library called libpython
. In Red Hat Enterprise Linux, Python is built in shared mode, because applications that embed Python, like Blender, use the Python C API of libpython
.
The python3.8
command is a minimalist example of embedding: It only calls the Py_BytesMain()
function:
int main(int argc, char **argv) { return Py_BytesMain(argc, argv); }
All the code lives in libpython
. For example, on RHEL 8.2, the size of /usr/bin/python3.8
is just around 8 KiB, whereas the size of the /usr/lib64/libpython3.8.so.1.0
library is around 3.6 MiB.
Semantic interposition
When executing a program, the dynamic loader allows you to override any symbol (such as a function) of the dynamic libraries that will be used in the program. You implement the override by setting the LD_PRELOAD
environment variable. This technique is called ELF symbol interposition, and it's enabled by default in GCC.
Note: In Clang, semantic interposition is disabled by default.
This feature is commonly used, among other things, to trace memory allocation (by overriding the libc malloc
and free
functions) or to change a single application's clocks (by overriding the libc time
function). Semantic interposition is implemented using a procedure linkage table (PLT). Any function that can be overridden with LD_PRELOAD
is looked up in a table before it is called.
Python calls libpython
functions from other libpython
functions. To respect semantic interposition, all of these calls must be looked up in the PLT. While this activity does introduce some overhead, the slowdown is negligible compared to the time spent in the called functions.
Note: Python uses the tracemalloc
module to trace memory allocations.
LTO and function inlining
In recent years, GCC has enhanced link-time optimization (LTO) to produce even more efficient code. One common optimization is to inline function calls, which means replacing a function call with a copy of the function's code. Once a function call is inlined, the compiler can go even further in terms of optimizations.
However, it is not possible to inline functions that are looked up in the PLT. If the function can be swapped out entirely using LD_PRELOAD
, the compiler cannot apply assumptions and optimizations based on what that function does.
GCC 5.3 introduced the -fno-semantic-interposition
flag, which disables semantic interposition. With this flag, functions in libpython
that call other libpython
functions don't have to go through the PLT indirection anymore. As a result, they can be inlined and optimized with LTO.
So, that's what we did. We enabled the -fno-semantic-interposition
flag in Python 3.8.
Drawbacks of -fno-semantic-interposition
The main drawback of building Python with -fno-semantic-interposition
enabled is that we can no longer override libpython
functions using LD_PRELOAD
. However, the impact is limited to libpython
. It is still possible, for example, to override malloc/free
from libc
to trace memory allocations.
However, this is still an incompatibility: We do not know if developers are using LD_PRELOAD
with Python on RHEL 8 in a way that would break with -fno-semantic-interposition
. That is why we only enabled the change in the new Python 3.8, while Python 3.6—the default python3
—continues to work as before.
Performance comparison
To see the -fno-semantic-interposition
optimization in practice, let's take a look at the _Py_CheckFunctionResult()
function. This function is used by Python to check whether a C function either returned a result (is not NULL
) or raised an exception.
Here is the simplified C code:
PyObject* PyErr_Occurred(void) { PyThreadState *tstate = _PyRuntime.gilstate.tstate_current; return tstate->curexc_type; } PyObject* _Py_CheckFunctionResult(PyObject *callable, PyObject *result, const char *where) { int err_occurred = (PyErr_Occurred() != NULL); ... }
Assembly code with semantic interposition enabled
Let's first take a look at Python 3.6 in Red Hat Enterprise Linux 7, which has not been built with -fno-semantic-interposition
. Here is an extract of the assembly code (read by's disassemble
command):
Dump of assembler code for function _Py_CheckFunctionResult: (...) callq 0x7ffff7913d50 <PyErr_Occurred@plt> (...)
As you can see, _Py_CheckFunctionResult()
calls PyErr_Occurred()
, and the call has to go through a PLT indirection.
Assembly code with semantic interposition disabled
Now let's look at an extract of the same assembly code after disabling semantic interposition:
Dump of assembler code for function _Py_CheckFunctionResult: (...) mov 0x40f7fe(%rip),%rcx # rcx = &_PyRuntime mov 0x558(%rcx),%rsi # rsi = tstate = _PyRuntime.gilstate.tstate_current (...) mov 0x58(%rsi),%rdi # rdi = tstate->curexc_type (...)
In this case, GCC inlined the PyErr_Occurred()
function call. As a result _Py_CheckFunctionResult()
gets the tstate
directly from _PyRuntime
, and then it directly reads its member tstate->curexc_type
. There is no function call and no PLT indirection, which results in faster performance.
Note: In more complex situations, the GCC compiler is free to optimize the inlined function even more, according to the context in which it is called.
Try it for yourself!
In this article, we focused on one specific improvement on the performance side, leaving new features to the upstream documents What's new In Python 3.7 and What's new In Python 3.8. If you are intrigued by the new compiler performance possibilities in Python 3.8, grab the python38
package from the Red Hat Enterprise Linux 8 repository and try it out. We hope you will enjoy the run speed-up, as well as a host of other new features that you will discover for yourself.