Continuous learning in Project Thoth using Kafka and Argo

Project Thoth provides Python programmers with information about support for packages they use, dependencies, performance, and security. Right now it focuses on pre-built binary packages hosted on the Python Package Index (PyPI) and other Python indexes. Thoth gathers metrics such as the following:

  • Solvers indicate whether a package can be installed on a particular runtime environment, such as Red Hat Enterprise Linux 8 running Python 3.6.
  • Security indicators turn up vulnerabilities and provide security advice by optimizing a software stack to minimize our computed security vulnerability score.
  • Project meta-information investigates project maintenance status and development process behavior that affects the overall project.
  • Amun and Dependency Monkey look for code quality issues or performance problems across packages.

Thoth's main role is to advise programmers about different software stacks based on requirements specified by the programmer. The component thoth-adviser then produces a locked software stack.

This article shows the tools and workflows that let Thoth intelligently respond to programmer requests when it can't find the relevant packages or related information.

How Thoth updates its knowledge of packages

In an ideal world, Thoth would have absolute knowledge of all versions of all Python packages. But in reality, users often request advice for a version or package that Thoth has not seen. Figure 1 shows the number of new versions released daily. PyPI alone grows by 500 to 2,000 packages per day; this makes it unlikely that Thoth will have perfect knowledge.

Python package version
Figure 1: Python package version releases published to PyPI per day from Oct. 27 to Nov. 2, 2020.

Thoth is trained to learn from its failures to find packages. When programmers request packages that Thoth doesn't know about, it schedules solvers to add them. The next section describes how Thoth uses messages and investigators to implement continuous learning, adding knowledge of new packages and versions to its database.

Events and messages for missing packages

Using a messaging/event platform, Thoth generates an event for each failure to find a package. These events are sent to Kafka, a highly scalable messaging platform maintained by the Apache Foundation. From there, they are directed through Argo, a workflow manager designed to work with Kafka, to a consumer that will try to discover the missing package.

thoth-messaging acts as a layer over the Confluent Kafka (confluent-kafka-python) package to create Thoth-specific messages and facilitate the creation of a producer or consumer. Support from Confluent offers confidence as to Confluent Kafka's long-term availability. This package, in turn, invokes a popular C extension called librdkafka.

Investigators and workflows

The core of continuous learning in Thoth is thoth-investigator, a Kafka message consumer that handles all message subscriptions sent through Confluent Kafka by the thoth-messaging library. The logic behind each consumer can be as simple as a remote function call to schedule a workflow; it can also involve more complex logic that transforms message contents or opens issues and pull requests on different Git services.

By deploying thoth-investigator in one namespace, Thoth is able to rely on a single component that has access to the other namespaces. This reduces the need to use role binding so that different components can access different namespaces.

Continuous learning

This section describes two common failures that cause Thoth's indicators to look for new information.

An adviser fails because it lacks the knowledge needed to provide advice

When a user requests advice, an adviser workflow is triggered depending on the integration used to interact with Thoth (see Thoth integrations). In this example, we'll use Kebechet, the GitHub app integration. When the workflow ends, Thoth provides advice to the programmer in the form specific to the integration: in this case, a check run shown in a GitHub pull request such as this example.

When Thoth fails because knowledge is missing, the logs indicate which package is missing. Using the workflow shown in Figure 2, Thoth discovers the missing information and generates the advice to return to the programmer.

The workflow when an advisor has to discover missing information
Figure 2. The workflow when an advisor has to discover missing information.

A simplified view of the workflow follows.

  1. The adviser workflow sends an UnresolvedPackageMessage message to thoth-investigator.
  2. thoth-investigator consumes the event messages and schedule solvers to learn about missing information.
  3. During the solver workflow, the investigator receives aSolvedPackageMessage message to indicate that the investigator should schedule the next workflows (i.e., security indicators).
  4. The solver workflow sends AdviserReRunMessages, which contains the information for the investigator to reschedule the advice that failed.

Thoth's security indicator workflow fails because a package or source distribution is missing

Thoth generates alerts if it has not performed security indicator (SI) analysis or if a new package becomes available. The investigator consumes these messages and starts new SI workflows. When a package's source code is available to Thoth, the system runs the SIs and stores the generated data. However, sometimes PyPI has only binary package releases available. Without a source distribution, Thoth cannot do static code analysis.

In such cases, the system sends a message back to the investigator, which sets a flag in the database to indicate that security information is missing. Thoth stores these errors so that workflows fail only once.

Similarly, the investigator updates the corresponding flag in Thoth's database after receiving a MissingVersionMessage message indicating that a package version has gone missing. Thoth will no longer use this package version when it gives advice.

Figure 3 shows the workflow for missing security information.

The workflow to handle missing security information
Figure 3. The workflow to handle missing security information.

Conclusion

With a constantly evolving supply of information, providing guarantees to users is difficult. Thoth aggregates information as needed through event-driven learning by using event streams (in Kafka) to trigger complex container workflows (in Argo). Both technologies are highly extensible, so new features are easy to add.

Last updated: August 11, 2023