Python became popular as a casual scripting language but has since evolved into the corporate space, where it is used for data science and machine learning applications, among others. Because Python is a high-level programming language, developers often use it to quickly prototype applications. Python native extensions make it easy to optimize any computation-intensive parts of the application using a lower-level programming language like C or C++.
For applications that need to scale, we can use Python Source-to-Image tooling (S2I) to convert a Python application into a container image. That image can then be orchestrated and scaled using cluster orchestrators such as Kubernetes or Red Hat OpenShift. All of these features together provide a convenient platform for solving problems using Python-based solutions that scale, are maintainable, and are easily extensible.
As a community-based project, the main source of open-source Python packages is the Python Package Index (PyPI). As of this writing, PyPI hosts more than 3 million releases, and the number of releases available continues to grow exponentially. PyPI's growth is an indicator of Python's popularity worldwide.
However, Python's community-driven dependency resolvers were not designed for corporate environments, and that has led to dependency management issues and vulnerabilities in the Python ecosystem. This article describes some of the risks involved in resolving Python dependencies and introduces Project Thoth's tools for avoiding them.
Dependency management in Python
The Python package installer, pip, is a popular tool for resolving Python application dependencies. Unfortunately, pip does not provide a way to manage lock files for application dependencies. Pip resolves dependencies to the latest possible versions at the given point in time, so the resolution is highly dependent on the time when the resolution process was triggered. Dependency problems such as overpinning (requesting too wide a range of versions) frequently introduce issues to the Python application stack.
To address lock file management issues, the Python community developed tools such as pip-tools, Pipenv, and Poetry. (Our article introducing micropipenv includes an overview of these projects.)
The Python Package Index is the primary index consulted by pip. In some cases, applications need libraries from other Python package indexes. For these, pip provides the --index-url and --extra-index-url options. Most of the time, there are two primary reasons you might need to install dependencies from Python package sources other than PyPI:
- Installing specific builds of packages whose features cannot be expressed using wheel tags, or that do not meet manylinux standards; e.g., the AVX2-enabled builds of TensorFlow hosted on the Python package index of the Artificial Intelligence Center of Excellence (AICoE).
- Installing packages that should not be hosted on PyPI, such as packages specific to one company or patched versions of libraries used only for testing.
Why Python is vulnerable to dependency confusion attacks
The pip options --index-url
and --extra-index-url
provide a way to specify alternate Python package indexes for resolving and installing Python packages. The first option, --index-url
, specifies the main Python package index for resolving Python packages, and defaults to PyPI. When you need a second package index, you can include the --extra-index-url
option as many times as needed. The resolution logic in pip first uses the main index, then, if the required package or version is not found there, it checks the secondary indexes.
Thus, although you can specify the order in which indexes are consulted, the configuration is not specified for each package individually. Moreover, the index configuration is applied for transitive dependencies introduced by direct dependencies, as well.
To bypass this order, application developers can manage requirements with hashes that are checked during installation and resolution to differentiate releases. This solution is unintuitive and error-prone, however. Although we encourage keeping hashes in lock files for integrity checks, they should be managed automatically using the appropriate tools.
Now, let’s imagine a dependency named foo
that a company uses on a private package index. Suppose a different package with the same name is hosted on PyPI. An unexpected glitch—such as a temporary network issue when resolving the company private package index—could lead the application to import the foo
package from PyPI in default setups. In the worst case, the package published on PyPI might be a malicious alternative that reveals company secrets to an attacker.
This issue also applies to pip-tools, Pipenv, and Poetry). Pipenv provides a way to configure a Python package index for a specific package, but it does not enforce the specified configuration. All the mentioned dependency resolution tools treat multiple Python package indexes supplied as mirrors.
Using Thoth to resolve dependency confusion
Thoth is a project sponsored by Red Hat that takes a fresh look at the complex needs of Python applications and moves the resolution process to the cloud. Naturally, being cloud-based has its advantages and disadvantages depending on how the tool is used.
Because Thoth moves dependency resolution to the cloud, a central authority can resolve application requirements. This central authority can be configured with fine-grained control over which application dependencies go into desired environments. For instance, you could handle dependencies in test environments and production environments differently.
Thoth's resolver pre-aggregates information about Python packages from various Python package indexes. This way, the resolver can monitor Python packages published on PyPI, on the AICoE-specific TensorFlow index, on a corporate Pulp Python index, on the PyTorch CUDA 11.1 index, and on builds for CPU use, which the PyTorch community provides for specific cases. Moreover, the cloud-based resolver introspects the published packages with respect to security or vulnerabilities (see PyPA’s Python Packaging Advisory Database) to additionally guide a secure resolution process.
Note: Please contact the Thoth team if you wish to register your own Python package index to Thoth.
Solver rules in Thoth
A central authority can be configured to allow or block packages or specific package releases that are hosted on the Python package indexes. This feature is called solver rules and is maintained by a Thoth operator.
Note: See Configuring solver rules in the Thoth documentation for more about this topic. Also check out our YouTube video demonstrating solver rules.
You can use solver rules to allow the Thoth operator to specify which Python packages or specific releases can be considered during the resolution process, respecting the Python package indexes registered when a request is made to the cloud-based resolver. You can also use solver rules to block the analysis of packages that are considered too old, are no longer supported, or simply don't adhere to company policies.
Note: Report issues with open source Python packages to help us create new solver rules.
Strict index configuration
Another feature in Thoth is the ability to configure a strict Python package index configuration. By default, the recommendation engine considers all the packages published on the indexes it monitors and uses a reinforcement learning algorithm to come up with a set of packages that are considered most appropriate. However, in some situations, Thoth users want to suppress this behavior and explicitly configure Python package indexes for consuming Python packages on their own.
Note: If you are interested in the strict index configuration, please browse the documentation and watch our video demonstration.
Prescriptions
Thoth also supports a mechanism called prescriptions that provides additional, detailed guidelines for package resolution. Prescriptions are analogous to manifests in Kubernetes and OpenShift. A manifest lists the desired state of the cluster, and the machinery behind the cluster orchestrator tries to create and maintain the desired state. Similarly, prescriptions provide a declarative way to specify the resolution process for the particular dependencies and Python package indexes used.
Note: See the prescriptions section in the Thoth documentation for more about this feature. You can also browse Thoth's prescriptions repository for prescriptions available for open source Python projects. See our article about prescriptions for more insight into this concept.
Thoth's reinforcement learning algorithm searches for a solution that satisfies application requirements, taking prescriptions into account. This algorithm provides the power to adjust the resolution process in whatever manner users desire. Adjustments to the resolution process can be made using labeled requests to the resolver which can pick prescriptions that match specified criteria written in YAML files. An example can be consuming all the packages solely from one package index (such as a Python package index hosted using Pulp) that hosts packages that can be considered as trusted for Thoth users.
About Project Thoth
As part of Project Thoth, we are accumulating knowledge to help Python developers create healthy applications. If you would like to follow project updates, please subscribe to our YouTube channel or follow us on the @ThothStation Twitter handle.
Last updated: September 20, 2023