Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Extracting dependencies from Python packages

January 14, 2022
Fridolin Pokorny
Related topics:
Data ScienceArtificial intelligenceContainersLinuxPython
Related products:
Red Hat Enterprise Linux

Share:

    Python's easy-to-learn syntax and rich standard library, combined with the large number of open source software packages available on the Python Package Index (PyPI), make it a common programming language of choice for quick prototyping leading to production systems. Python is a good fit for many use cases, and is particularly popular in the data science domain for data exploration and analysis.

    Thus, Python's rapid rise on the TIOBE Index of the most popular programming languages shouldn't be a surprise. PyPI hosts more than 3 million releases of Python packages. Each package release has metadata associated with it, which makes the packages themselves an interesting dataset to explore and experiment with.

    In this article, you'll learn how to extract metadata and dependency information from Python package releases. You'll also see how this process works in Project Thoth, which provides Python programmers with information about support for the packages they use, along with the dependencies, performance, and security of those packages.

    Python package releases and PyPI

    The bar chart in Figure 1 shows the number of Python package releases on PyPI from March 2005 to mid-July 2021, with each bar representing one month. As you can see, the number of package releases is growing more or less exponentially.

    A chart showing Python package releases available on PyPI from March 2005 until mid-July 2021. Each bar represents the number of Python package releases available per month.
    Figure 1. The number of Python package releases available on PyPI from March 2005 until mid-July 2021.

    As the Python Package Index's name suggests, it really is an index of software packages (as an example, see the links for Flask releases). A simple artifact listing has its pros and cons. One of the advantages is easy artifact serving for self-hosted Python package indexes or mirrors. If you provide a simple HTTP server with exposed content conforming to the Simple Repository API (Python Enhancement Proposal 503), then all the Python client tooling, such as pip, will be automatically able to use your self-hosted Python package indexes and install packages from your server. A downside of this approach is the lack of additional package metadata, especially dependency information.

    Why collecting Python dependency information is challenging

    Dustin Ingram, a PyPI maintainer, wrote about the challenges of collecting Python dependency information in Why PyPI doesn't know your project's dependencies. In short, Python's source distributions execute code that is supposed to provide information about dependencies at installation time. Because the dependency listing is not provided statically, but is a result of arbitrary code execution, dependencies can be specific to the installation-script logic. This allows for computing dependencies at installation time and gives the power to express dependencies dynamically. On the other hand, the behavior is generally unpredictable and can cause headaches when trying to obtain dependency information for a package release.

    Note: Dependencies are generally computed based on the runtime environment where the installation process executes arbitrary code. As a result, the installation can be used by malicious Python package releases to steal environment information or perform other malicious actions at installation time.

    Recent changes to Python packaging standards have shifted away from providing dependency information during installation, and toward exposing it statically in built wheels (PEP 427). Newer Python package releases often follow this trend, but Python packaging and tooling also tries to be backward compatible as much as possible. For a more in-depth explanation, see Python packaging: Why don’t you just...?, a presentation from Tzu-ping Chung, one of the Python package maintainers.

    How Thoth collects dependency information

    Python artifacts specific to a Python package release can provide multiple builds besides source distributions. These builds target different environments and respect Python's packaging tags for built distributions (PEP 425). It's up to pip (or whatever installer you choose) to select the correct built distribution for the environment in which the installer is running. These tags can specify ABI, platform, or other requirements for the target environment, as discussed in the PEP 425 documentation. If none of the built distributions match the target environment, the installer can fall back to installing the release from source distributions if provided. This process might involve additional requirements for the target environment, such as a compatible build toolchain if source distributions require building native extensions.

    To streamline the whole process, Project Thoth offers a component that re-uses the logic that performs these actions in pip. This component, thoth-solver, is written as a Python application that is primarily designed to run in containerized environments. The thoth-solver component installs Python packages in the specified version from the desired Python package index, by letting pip decide which Python artifact should be installed into the environment where thoth-solver runs. This naturally can involve triggering package builds out of source distributions as necessary. Once the package is installed using pip's logic, thoth-solver extracts metadata out of the installed artifact, together with additional information about the thoth-solver run itself.

    The result is a JSON document containing information about the artifact together with the environment in which the solver runs, Python-specific entries (such as hashes of files), and Python's core metadata. It may also include additional dependency information, such as details about version range specifications, versions matching version range specifications of dependencies, extras, or environment markers, along with evaluation results specifically tailored for the containerized environment (see PEP 508 for more information). Thoth can obtain this information from multiple Python package indexes that host artifacts analyzed by thoth-solver as well as dependencies for artifacts hosted on other indexes (for example, AVX2-enabled builds of TensorFlow hosted on the AI Center of Excellence index). The procedure and data aggregated allow Thoth to check how packages form dependencies across different Python package indexes for cross-index Python package resolution.

    Note: If a given package is not installable into the containerized environment (due to incompatibilities between Python 2 and 3, or a missing build toolchain, for example), thoth-solver reports information about the failure that can be further post-processed to extract relevant details and classify the error.

    To see how thoth-solver works in practice, take a look at this example output from a thoth-solver run for Flask in version 2.0.2 available from PyPI. The result gives information about dependencies for flask==2.0.2 when installed into a containerized Red Hat Universal Base Image Red Hat Enterprise Linux 8 environment running Python 3.8 at the given point in time. The containerized environment is available on Quay as solver-rhel-8-py38.

    Using thoth-solver

    The thoth-solver component is part of Project Thoth's cloud-based Python resolver. It aggregates information about dependencies in Thoth's background data aggregation and makes them available for Thoth's resolver. The Thoth team provides multiple thoth-solver containerized environments, built container images of which are available on Quay. These compute dependency information specifically for their target environment—a reproducible environment with a predefined software stack—for each desired Python package release individually.

    Keep in mind that the computed dependency information is specific to the particular point in time when thoth-solver is run. As packages get new releases, another component in Thoth—the revsolver, or "reverse solver"—can keep the dependency information up to date. The revsolver component uses data that has already been computed by thoth-solver and is available in a queryable form in Thoth's database. In this case, revsolver does not download any artifacts, but instead uses an already captured dependency graph available to propagate information about a new package release, which becomes part of the updated ecosystem's dependency graph available in the database.

    About Project Thoth

    As part of Project Thoth, we are accumulating knowledge to help Python developers create healthy applications. If you would like to follow updates, feel free to subscribe to our YouTube channel or follow us on the @ThothStation Twitter handle.

    To send us feedback or get involved in improving the Python ecosystem, please contact the Thoth Station support repository. You can also directly reach out to the Thoth team on Twitter. You can report any issues you've spotted in open source Python libraries to the support repository or directly write prescriptions for the resolver and send them to our prescriptions repository. By participating in these ways, you can help the Python cloud-based resolver come up with better recommendations.

    Last updated: January 5, 2023

    Related Posts

    • Customize Python dependency resolution with machine learning

    • Prevent Python dependency confusion attacks with Thoth

    • Thoth prescriptions for resolving Python dependencies

    • Resolve Python dependencies with Thoth Dependency Monkey

    • Build and extend containerized applications with Project Thoth

    Recent Posts

    • AI meets containers: My first step into Podman AI Lab

    • Live migrating VMs with OpenShift Virtualization

    • Storage considerations for OpenShift Virtualization

    • Upgrade from OpenShift Service Mesh 2.6 to 3.0 with Kiali

    • EE Builder with Ansible Automation Platform on OpenShift

    What’s up next?

     

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue