Container technologies have created a de facto industry standard for developing, deploying, and shipping applications. Containers make it possible to provide more maintainable and self-sustaining runnable units that can be directly managed using cluster orchestrators such as Kubernetes and Red Hat OpenShift.
This article is for developers interested in using intelligent package management to control the quality of container images and provide more robust containerized runtime environments. Our discussion is based on Project Thoth for Python, one of the world's most popular programming languages. The ideas we present can be generalized to other language ecosystems.
Thoth and Python packaging standards
One of our previous articles discussed tools that allow installations of Python modules following the packaging standards provided by the Python Packaging Authority (PyPA). We'll continue that focus in this article.
Note: Anaconda is another packaging solution for Python, but it creates environments that don't conform to PyPA standards, so we won't discuss Anaconda in this article.
Tools such as pip, Pipenv, and Poetry tend to resolve application stacks to the latest possible libraries available to date (respecting specified version ranges), considering the runtime environment they run in. Project Thoth offers more flexibility, proposing packages that meet the developer's quality, security, and performance criteria.
Because Python is a language of choice for data scientists, a very common environment for data preprocessing, data analysis, and data exploration is Jupyter notebook. In a previous article, we described an extension called jupyterlab-requirements that integrates with the tooling we discuss in this article. The extension helps generate reproducible installations inside notebooks and can consume recommendations by Thoth’s recommender system.
A smarter way to analyze container images and predictable stacks
As we've mentioned, container technologies create de facto application standards. Anyone can download prepared container images from container image registries, such as Quay.io, and run the application after some minimal setup. An example of publicly available images is the Jupyter image that can be used to spawn a Jupyter notebook environment. In such a case, the image is pulled and run in a cluster or locally, based on the image and the developer’s use case.
Container images bundle content that is required to run the application. Project Thoth offers container image analyses that introspect what is present in the container image. Notably, it can extract:
- Information about the operating system
- Information about RPM packages that are present in the container image
- Python packages that are present in the container image and their locations, if multiple virtual environments are available
- Python interpreters and their available versions
- Information about the ABI provided
- Container image metadata as extracted by Skopeo
- Information about other libraries, such as the CUDA version (GPU software) available
This information is automatically extracted from container images, ready to be explored by developers as well as consumed by the cloud-based Python resolver, which offers recommendations based on the content available in container images. The container image analysis is run in an OpenShift cluster and the results are computed using the package-extract component.
Container images for data science
Thoth additionally provides a set of container images that were identified as suitable for Python developers or data scientists:
- ps-ip is for images suitable for image processing.
- ps-cv is for images designed for computer vision.
- ps-nlp is for images dedicated to natural language processing.
The project makes it easier for developers to create a containerized environment for running applications without needing to fix dependency issues or provide missing content for the environment.
Building container images with artificial intelligence
Project Thoth is associated with Red Hat's Artificial Intelligence Center of Excellence (AICoE) and tightly integrates with AICoE's other tools. AICoE-CI is a service that builds container images using Tekton pipelines under the hood. Once a build is done, the resulting container image is sent to Thoth for analysis. If a container image build fails, AICoE-CI automatically reports the failure to the Thoth backend together with build logs capturing information about the failure. Figure 1 shows how the recommender system gathers information about container images built in AICoE-CI.
Thoth uses the combined build information to provide better recommendations for using the container images produced. If developers are running their applications in noncontainer environments, Thoth can offer guidance on software that doesn't have the issues seen in AICoE-CI during container image builds.
Note: Built container images can be tested using Thoth Dependency Monkey.
Thoth recommendations for containerized applications
Open source resolvers, such as pip, Pipenv, and Poetry, resolve Python software packages inside the environments where the resolvers run. The resolution process can be additionally adjusted using environment markers. Thoth’s cloud resolver goes a step further in this area, serving developers who build container images by accounting for runtime environment information even outside the Python packaging standards.
The resolver considers the results of container image analyses listed earlier, along with available hardware, to guide the resolution process and come up with the best configuration for a given application. Figure 2 shows how the recommender system (the Thoth resolver implemented in a component called adviser) uses the gathered information.
If no container image is used, Thoth’s resolver falls back to the standard resolution process compatible with the Python packaging standards. In both cases, Thoth’s resolution process additionally offers developers guidance about the software stack in use, such as by adjusting environment variables to make sure the environment is correctly set up.
The recommendation engine uses centralized knowledge about Python software packages as well as software and hardware environments. This knowledge guides the resolution process to satisfy the application's needs. Together with Thoth prescriptions, the container image analyses and post-processed container image build logs provide valuable guidance on all the building blocks of a containerized application (Figure 3).
Use cases for Thoth's cloud resolver and prescriptions
An example of a problem that was fixed by Thoth’s cloud resolver was an issue reported in the flask-openid package. This package was no longer installable into environments with a recent Setuptools package that dropped 2to3 support. To avoid trying to install flask-openid
into environments that have this version of Setuptools
, Thoth provides a prescription that checks which Setuptools
package is shipped in the used container image. The cloud resolver automatically avoids resolving flask-openid
versions that would cause installation failures and looks for another resolution path.
Another Thoth prescription declares a requirement for the Git RPM package to be present in the container image in order for the GitPython package to operate. If the base container image used to build the application does not offer Git, the resolver again tries to find another resolution path so that the resulting container image will work.
Another use case is for developers or data scientists using opencv-python or PyTorch in their environment. In that case, Thoth recommends using a pre-built container image with a computer vision stack built from the ps-cv repository.
Resolving to multiple container images
With the widespread adoption of containers, applications can be split into multiple container images. These container images create separate entities that can communicate with each other via a specified protocol. To make sure a resolution process can target multiple container images at the same time, the resolver offers labeled requests to the resolution engine. The resolution still takes place for each container image individually, but will keep a context. Within this context, labels can specify how the resolution process should operate to make sure the resolution to multiple containers is done properly and meets desired criteria (e.g., ensuring the proper operation of the communication layer made out of multiple packages that form an application dependency subgraph).
Extending already available container images
Yet another specific use case is extending prebuilt container images. An example is a TensorFlow container image used for model training. If a developer wants to extend the container image, let’s say by installing TensorBoard to visualize the trained model, the developer can ask Thoth for an advisory. If the base container image is supplied, Thoth can adjust the resolution process based on already existing Python packages that are available, and pick the most appropriate TensorBoard package that will work inside the container image.
Feel free to browse the open source database available at our prescriptions repository to find more recommendations for open source Python software packages, including some recommendations not solely dedicated to container images.
Helping the Python community create healthy applications
As part of Project Thoth, we are accumulating knowledge about Python packages to help Python developers create healthy and secure applications. We suggest you analyze some of your container images using Thoth. You can submit an analysis request to Thoth’s endpoints, and they will analyze your container image. See an example container image analysis result for the quay.io/thoth-station/ps-cv-pytorch:v0.1.2 container image. (Note that the file size is 7.4MB.)
To follow updates in the project, please subscribe to the Thoth Station YouTube channel or follow us on Twitter at @ThothStation.
Last updated: September 20, 2023