AI software stack inspection with Thoth and TensorFlow

AI software stack inspection with Thoth and TensorFlow

Project Thoth develops open source tools that enhance the day-to-day life of developers and data scientists. Thoth uses machine-generated knowledge to boost the performance, security, and quality of your applications using artificial intelligence (AI) through reinforcement learning (RL). This machine-learning approach is implemented in Thoth adviser (if you want to know more, click here) and it is used by Thoth integrations to provide the software stack based on user inputs.

In this article, I introduce a case study—a recent inspection of a runtime issue when importing TensorFlow 2.1.0—to demonstrate the human-machine interaction between the Thoth team and Thoth components. By following the case study from start to finish, you will learn how Thoth gathers and analyzes some of the data to provide advice to its users, including bots such as Kebechet, AI-backed continuous integration pipelines, and developers using GitHub apps.

Both the Thoth machinery and team rely on bots and automated pipelines running on Red Hat OpenShift. Thoth takes a variety of inputs to determine the correct advice:

  • Solver, which Thoth uses to discover if something can be installed in a particular runtime environment, such as Red Hat Enterprise Linux (RHEL) 8 with Python 3.6.
  • Security indicators that uncover vulnerabilities of a different nature, which can be applied to security advice.
  • Project meta information, such as project-maintenance status or development-process behavior that affects the overall project.
  • Inspections, which Thoth uses to discover code quality issues or performance across packages.

This article focuses on inspections. I will show you the results from an automated software stack inspection run through Project Thoth‘s Dependency Monkey and Amun components. Thoth uses automated inspections to introduce new advice about software stacks for Thoth users. Another way to integrate advice could be via automated pipelines that can:

  • Boost performance
  • Optimize machine learning (ML) model inference
  • Ensure that there are no failures during the model runtime (for example, during inference)
  • Avoid using software stacks that does not guarantee security.

Thoth components: Amun and Dependency Monkey

Given the list of packages that should be installed and the hardware requested to run the application, Amun executes the requested application stack in the requested environment. Amun acts as an execution engine for Thoth. Applications are then built and tested using Thoth Performance Indicators (PI). See Amun’s README documentation for more information about this service.

Another Thoth component, Dependency Monkey, can be used to schedule Amun. Dependency Monkey was designed to automate the evaluation of certain aspects of a software stack, such as code quality or performance. Therefore, it aims to automatically verify software stacks and aggregate relevant observations.

From these two components, the Thoth team created Thoth Performance Datasets, which contains observations about performance for software stacks. For example, Thoth Performance Datasets could use PIconv2d to obtain performance data for different application types (such as machine learning) and code quality. It could then use a performance indicator like PiImport to discover errors during an application run.

Everything you need to grow your career.

With your free Red Hat Developer program membership, unlock our library of cheat sheets and ebooks on next-generation application development.

SIGN UP

Transparent and reproducible datasets

In the spirit of open source, the Thoth team wants to guarantee that the datasets and knowledge that we collect and use are transparent and reproducible. Machine learning models, such as the reinforcement learning model leveraged by Thoth Adviser, should be as transparent as the datasets they are working on.

For transparency, we’ve introduced Thoth Datasets, where we share the notebooks that we used to analyze a data collection and all of the results. We encourage anyone interested in the topic to use Thoth Datasets to verify our findings or for other purposes.

For reproducibility, we’ve introduced Dependency Monkey Zoo, where we collect all of the specifications used to run an analysis. Having all of the specs in one place allows us to reproduce the results of a study. Anyone can use the specs to perform similar studies in different environments for comparison.

Case study: Automated software stack inspection for TensorFlow 2.1.0

For this case study, we will use Thoth’s Amun and Dependency Monkey components to automatically produce data. We’ll then introduce reusable Jupyter notebook templates to extract specific information from the datasets. Finally, we’ll create new advice based on the results.

The human side of this human-machine interaction focuses on assessing the quality of the results and formulating the advice. The rest of the process is machine-automated. Automation makes the process easy to repeat to produce a new source of information for analysis.

In the next sections, I introduce the initial problem, then describe the analysis performed and the resulting new advice for Thoth users.

Initial request

Our goal with this inspection is to analyze build- and runtime failures when importing TensorFlow 2.1.0 and use these to derive observations about the quality of the software stack.

For this analysis, Dependency Monkey sampled the state space of all of the possible TensorFlow==2.1.0 stacks (from upstream builds). For inspection purposes, we built and ran the application using the PiMatmul performance indicator.

The sections below detail the Dependency Monkey inspection results and the resulting analysis.

The first analysis

From the software stack analysis of inspection results, we discovered that TensorFlow 2.1.0 was giving errors during approximately 50% of inspections during a run. The error is shown in the following output from the Jupyter Notebook:

'2020-09-05 07:14:36.333589: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library \'libnvinfer.so.6\'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-05 07:14:36.333811: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library \'libnvinfer_plugin.so.6\'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-05 07:14:36.333844: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/opt/app-root/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
/opt/app-root/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.5) or chardet (2.3.0) doesn\'t match a supported version!
RequestsDependencyWarning)
Traceback (most recent call last):
 File "/home/amun/script", line 14, in <module>
  import tensorflow as tf
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow/__init__.py", line 101, in <module>
from tensorflow_core import *
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow_core/__init__.py", line 40, in <module>
from tensorflow.python.tools import module_util as _module_util
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow/__init__.py", line 50, in __getattr__
module = self._load()
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow/__init__.py", line 44, in _load\n    module = _importlib.import_module(self.__name__)
File "/opt/app-root/lib64/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow_core/python/__init__.py", line 95, in <module>
from tensorflow.python import keras
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow_core/python/keras/__init__.py", line 27, in <module>
from tensorflow.python.keras import models
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow_core/python/keras/__init__.py", line 27, in <module>
from tensorflow.python.keras import models
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow_core/python/keras/models.py", line 25, in <module>
from tensorflow.python.keras.engine import network
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/network.py", line 46, in <module>
from tensorflow.python.keras.saving import hdf5_format
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 32, in <module>
from tensorflow.python.keras.utils import conv_utils
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow_core/python/keras/utils/conv_utils.py", line 22, in <module>
from six.moves import range  # pylint: disable=redefined-builtin
ImportError: cannot import name \'range\''

Specifically, we could see that some combinations of six and urllib3 produced that error, as described in the following output:

=============================================
urllib3
=============================================

In successfull inspections:
['urllib3-1.10.4-pypi-org' 'urllib3-1.16-pypi-org' 'urllib3-0.3-pypi-org'
'urllib3-1.21.1-pypi-org' 'urllib3-1.25.1-pypi-org'
'urllib3-1.25-pypi-org' 'urllib3-1.18.1-pypi-org'
'urllib3-1.24.1-pypi-org' 'urllib3-1.10.1-pypi-org'
'urllib3-1.10.3-pypi-org' 'urllib3-1.25.7-pypi-org'
'urllib3-1.10-pypi-org' 'urllib3-1.7.1-pypi-org' 'urllib3-1.13-pypi-org'
'urllib3-1.19.1-pypi-org' 'urllib3-1.11-pypi-org'
'urllib3-1.10.2-pypi-org' 'urllib3-1.15.1-pypi-org'
'urllib3-1.25.3-pypi-org' 'urllib3-1.13.1-pypi-org'
'urllib3-1.21-pypi-org' 'urllib3-1.17-pypi-org' 'urllib3-1.23-pypi-org']

In failed inspections:
['urllib3-1.5-pypi-org']

In failed inspections but not in successfull:
{'urllib3-1.5-pypi-org'}

In failed inspections and in successfull:
set()


=============================================
six
=============================================

In successfull inspections:
['six-1.13.0-pypi-org' 'six-1.12.0-pypi-org']

In failed inspections:
['six-1.13.0-pypi-org' 'six-1.12.0-pypi-org']

In failed inspections but not in successfull:
set()

In failed inspections and in successfull:
{'six-1.13.0-pypi-org', 'six-1.12.0-pypi-org'}

Therefore, we discovered that urllib3 library releases were the same across all failed inspections but not in any of the successful inspections, while six library releases didn’t show any differences between failed and successful once.

The second analysis

For our next step, we decided to run another analysis to restrict the cases. For this run, we used a newly created performance indicator called PiImport as shown in Table 1.

Table 1: The PiImport performance indicator.
Description Dependency Monkey sampled the state space of all the possible TensorFlow==2.1.0 stacks (from upstream builds). The application was built and run using the PiImport performance indicator.
Specification   Dependency Monkey specification
Goal Identify specific versions that fail to produce final advice.
Reference Issue

Results of the second analysis

From the new analysis, we were able to identify all of the specific versions of urllib3 and six that did not work together and that were causing issues during runtime. The output in Figure 1 shows the incompatible versions of the two packages.

dFigure 1: Identifying the incompatible versions of urllib3 and six that do not allow to run Tensorflow 2.1.0.

The advice

All of this backtracing led to an adviser step called TensorFlow21Urllib3Step. With this step, we can penalize software stacks containing the specific version of urllib3 that cause runtime issues when attempting to import TensorFlow 2.1.0. The following prediction, created by Thoth, results in a higher quality software stack for users.

Table 2: The TensorFlow21Urllib3Step adviser step.
Title TensorFlow in version 2.1 can cause runtime errors when imported, caused by incompatibility between urllib3 and six packages.
Issue description Package urllib3 in some versions is shipped with a bundled version of six, which has its own mechanism for imports and import context handling. Importing urllib3 in the TensorFlow codebase causes initialization of the bundled six module, which collides with a subsequent import from unbundled six modules.

You can find the complete issue description, and the recommended resolution, here.

Share