Python's easy-to-learn syntax and rich standard library, combined with the large number of open source software packages available on the Python Package Index (PyPI), make it a common programming language of choice for quick prototyping leading to production systems. Python is a good fit for many use cases, and is particularly popular in the data science domain for data exploration and analysis.
Thus, Python's rapid rise on the TIOBE Index of the most popular programming languages shouldn't be a surprise. PyPI hosts more than 3 million releases of Python packages. Each package release has metadata associated with it, which makes the packages themselves an interesting dataset to explore and experiment with.
In this article, you'll learn how to extract metadata and dependency information from Python package releases. You'll also see how this process works in Project Thoth, which provides Python programmers with information about support for the packages they use, along with the dependencies, performance, and security of those packages.
Python package releases and PyPI
The bar chart in Figure 1 shows the number of Python package releases on PyPI from March 2005 to mid-July 2021, with each bar representing one month. As you can see, the number of package releases is growing more or less exponentially.
As the Python Package Index's name suggests, it really is an index of software packages (as an example, see the links for Flask releases). A simple artifact listing has its pros and cons. One of the advantages is easy artifact serving for self-hosted Python package indexes or mirrors. If you provide a simple HTTP server with exposed content conforming to the Simple Repository API (Python Enhancement Proposal 503), then all the Python client tooling, such as pip, will be automatically able to use your self-hosted Python package indexes and install packages from your server. A downside of this approach is the lack of additional package metadata, especially dependency information.
Why collecting Python dependency information is challenging
Dustin Ingram, a PyPI maintainer, wrote about the challenges of collecting Python dependency information in Why PyPI doesn't know your project's dependencies. In short, Python's source distributions execute code that is supposed to provide information about dependencies at installation time. Because the dependency listing is not provided statically, but is a result of arbitrary code execution, dependencies can be specific to the installation-script logic. This allows for computing dependencies at installation time and gives the power to express dependencies dynamically. On the other hand, the behavior is generally unpredictable and can cause headaches when trying to obtain dependency information for a package release.
Note: Dependencies are generally computed based on the runtime environment where the installation process executes arbitrary code. As a result, the installation can be used by malicious Python package releases to steal environment information or perform other malicious actions at installation time.
Recent changes to Python packaging standards have shifted away from providing dependency information during installation, and toward exposing it statically in built wheels (PEP 427). Newer Python package releases often follow this trend, but Python packaging and tooling also tries to be backward compatible as much as possible. For a more in-depth explanation, see Python packaging: Why don’t you just...?, a presentation from Tzu-ping Chung, one of the Python package maintainers.
How Thoth collects dependency information
Python artifacts specific to a Python package release can provide multiple builds besides source distributions. These builds target different environments and respect Python's packaging tags for built distributions (PEP 425). It's up to pip (or whatever installer you choose) to select the correct built distribution for the environment in which the installer is running. These tags can specify ABI, platform, or other requirements for the target environment, as discussed in the PEP 425 documentation. If none of the built distributions match the target environment, the installer can fall back to installing the release from source distributions if provided. This process might involve additional requirements for the target environment, such as a compatible build toolchain if source distributions require building native extensions.
To streamline the whole process, Project Thoth offers a component that re-uses the logic that performs these actions in pip. This component, thoth-solver, is written as a Python application that is primarily designed to run in containerized environments. The thoth-solver component installs Python packages in the specified version from the desired Python package index, by letting pip decide which Python artifact should be installed into the environment where thoth-solver runs. This naturally can involve triggering package builds out of source distributions as necessary. Once the package is installed using pip's logic, thoth-solver extracts metadata out of the installed artifact, together with additional information about the thoth-solver run itself.
The result is a JSON document containing information about the artifact together with the environment in which the solver runs, Python-specific entries (such as hashes of files), and Python's core metadata. It may also include additional dependency information, such as details about version range specifications, versions matching version range specifications of dependencies, extras, or environment markers, along with evaluation results specifically tailored for the containerized environment (see PEP 508 for more information). Thoth can obtain this information from multiple Python package indexes that host artifacts analyzed by thoth-solver as well as dependencies for artifacts hosted on other indexes (for example, AVX2-enabled builds of TensorFlow hosted on the AI Center of Excellence index). The procedure and data aggregated allow Thoth to check how packages form dependencies across different Python package indexes for cross-index Python package resolution.
Note: If a given package is not installable into the containerized environment (due to incompatibilities between Python 2 and 3, or a missing build toolchain, for example), thoth-solver reports information about the failure that can be further post-processed to extract relevant details and classify the error.
To see how thoth-solver works in practice, take a look at this example output from a thoth-solver run for Flask in version 2.0.2 available from PyPI. The result gives information about dependencies for flask==2.0.2 when installed into a containerized Red Hat Universal Base Image Red Hat Enterprise Linux 8 environment running Python 3.8 at the given point in time. The containerized environment is available on Quay as solver-rhel-8-py38.
The thoth-solver component is part of Project Thoth's cloud-based Python resolver. It aggregates information about dependencies in Thoth's background data aggregation and makes them available for Thoth's resolver. The Thoth team provides multiple thoth-solver containerized environments, built container images of which are available on Quay. These compute dependency information specifically for their target environment—a reproducible environment with a predefined software stack—for each desired Python package release individually.
Keep in mind that the computed dependency information is specific to the particular point in time when thoth-solver is run. As packages get new releases, another component in Thoth—the revsolver, or "reverse solver"—can keep the dependency information up to date. The revsolver component uses data that has already been computed by thoth-solver and is available in a queryable form in Thoth's database. In this case, revsolver does not download any artifacts, but instead uses an already captured dependency graph available to propagate information about a new package release, which becomes part of the updated ecosystem's dependency graph available in the database.
About Project Thoth
As part of Project Thoth, we are accumulating knowledge to help Python developers create healthy applications. If you would like to follow updates, feel free to subscribe to our YouTube channel or follow us on the @ThothStation Twitter handle.
To send us feedback or get involved in improving the Python ecosystem, please contact the Thoth Station support repository. You can also directly reach out to the Thoth team on Twitter. You can report any issues you've spotted in open source Python libraries to the support repository or directly write prescriptions for the resolver and send them to our prescriptions repository. By participating in these ways, you can help the Python cloud-based resolver come up with better recommendations.