Using knowledge graphs to discover open source package vulnerabilities

Technology and infrastructure generate an enormous amount of data on a day-to-day basis. Building knowledge out of this data in various real-world domains can be a big challenge. This article describes how to derive concise and precise knowledge from data and use it to track vulnerabilities in the software stack. It presents challenges related to package security and vulnerability and how they can be addressed using a knowledge graph. After reading this article, you'll understand the concept of the knowledge graph and how you can apply it to your domain.

Why open source packages introduce vulnerabilities

Most large organizations employ open source libraries and components to build software used internally and externally. While open source helps solve many problems, developers need to understand how to track security vulnerabilities in all the software they use. Security vulnerabilities can be brought into host software by the libraries and tools on which it depends. The problem grows as more and more dependencies are included.

Ensuring the most secure versions of software dependencies are used can be tedious and time consuming. Although open source package vulnerability databases are frequently updated, tracking those databases on a daily basis, or even per release, can be difficult during medium- and large-scale software development.

Questions you need to consider include:

  • What direct dependencies are vulnerable?
  • What is the recommended version to use for a dependency?
  • What exploits are a danger on a software stack based on current dependencies?
  • How do recent additions to the common vulnerabilities and exposures (CVE) list affect the application? What dependencies need to be modified to mitigate the risk?
  • How many projects in an organization are affected by newly reported vulnerabilities?
  • Are there any companion packages that can be used to mitigate the security risk?
  • What security risks are caused by transitive packages getting into software?

Knowledge graphs can help you find answers to these questions in a fast, automated manner.

What are knowledge graphs?

Real-world domains consist of many objects and the relationships between them. Because each object has relationships with others, which in turn have relationships with yet other objects, a complex mesh of interconnected objects (or entities) emerges. In software design, we can represent these kinds of complex relationships between entities with a graph structure.

A knowledge graph represents data about a knowledge domain as a graph structure that exposes the relationships between each pair of entities. A knowledge graph can organize huge, complex data in a meaningful way that makes the data easier to understand and be consumed. Human minds are built up of knowledge graphs, which allow us to relate concepts and extract required information in an efficient way (accurately and quickly).

The advantages of a knowledge graph

To understand the concept of a knowledge graph, let's consider a simple example. Say a company provides on-demand movies as a service. One of its business challenges is to provide more accurate recommendations based on the viewer's watch history. This learning has to be dynamic so the recommendations can change as viewer preferences shift over time.

The company could approach this challenge by arranging all of its data into a sequential data store and fetching recommendations, which are created by filtering the movies based on attributes from the viewer's watch history, such as genre, actors, production house, etc.

There are a few issues with this kind of flat data mining, though:

  • It is time-consuming, as we need to search the entire database for every recommendation.
  • Arranging results in order of most significant to least can be complex.
  • It does not scale well as the number of movies and viewers grows.

Alternatively, we can arrange this movie data in a graph structure, with nodes representing entities like genre, lead actors, producer, release date, etc., and edges representing the relationship. Figure 1 shows the lead actor and genre of three movies.

Movie data arranged in knowledge graph format
Figure 1: Movie data arranged in knowledge graph format.

Assume that a viewer has watched only one movie on the company's platform (for example, Terminator 2: Judgement Day) and we have only the preceding information in our knowledge graph. The system can find the other movies with the same lead actor (in this case, Predator and Commando). To build in user preference, the system can also consider genre information. We see that Terminator 2: Judgement Day and Predator both fall under the Science Fiction genre. Given the limited data for this viewer, the recommendation system can easily rank the other movies they might enjoy—Predator, followed by Commando.

This is a basic example to illustrate the concept. In fact, you will generally have many nodes and edges interconnecting them. The good part of this complex graph structure is that graph traversal does not grow exponentially with data size, as in the case of sequential data. At each transversal, we eliminate lots of invalid data from scanning, which provides efficient and quick information retrieval.

Knowledge graphs are used in many knowledge domains to serve real-world applications such as:

  • Search engines, to provide more appropriate content to the user.
  • Recommendation systems for e-commerce, entertainment, health, etc.
  • Targeted advertisement systems.
  • Offers and promotions for e-commerce.
  • User behavior analysis.

Advantages that knowledge graphs bring include:

  • Semantic data representation.
  • Explicit knowledge representation.
  • Insights about data.
  • Continuously updated data.
  • Self-learning and adapting system.
  • Integration with machine learning and artificial intelligence.

Now let's take a look at how a knowledge graph can help us address vulnerabilities in software dependencies.

Managing data when assessing package vulnerabilities

A knowledge graph for package vulnerabilities requires a layout of the depth and breadth of data involved in this domain. If an application has just a couple of dependencies, most security questions can be answered by scanning source repositories and vulnerability databases. To find a dependency without known vulnerabilities, you need to match dependencies against non-vulnerable versions. However, the problem gets much more complicated when you have many dependencies, each in turn with its own dependencies (transitive dependencies). Today’s modern software solutions have many direct dependencies on other packages or modules produced by other developers, and these dependencies in turn are built on other packages.

Many packages have 100 or more versions, with many CVEs filed against those versions. Security and vulnerability data offered by private vendors add even more data requiring analysis. The possible combinations of these disparate sources result in a huge amount of data that needs to be scanned to identify the right combination of versions that have minimal or no vulnerabilities.

Additional burdens are added because the data stored is not static and changes over time. New CVEs may be discovered for older versions, there could be fixes for existing CVEs, etc. Thus, administrators need a considerable amount of time to keep their systems up to date with the latest and most accurate information.

Vulnerability assessment is too often ignored or looked down on as trivial work by software developers. The assessments require going through lots of data in order to find each vulnerability. When vulnerability assessment is not omitted altogether, it often is pushed off until the release date, where packages then get caught and blocked by security scanning software or the security team. Then the developers have to scramble to find the right dependencies and update their release.

Moving from data to knowledge

The proposed solution in this article is to organize the huge data about package versions and vulnerabilities from various sources into a knowledge base that can be queried as needed. The knowledge base can provide up-to-date and accurate information immediately. Other advantages include:

  • A simpler, faster way for a developer to check for vulnerabilities.
  • More informed and accurate decisions about dependencies.
  • Security and vulnerability checks that occur much earlier in the development cycle of the software stack.
  • Continuous vulnerability scanning, such as by nightly jobs, or on every code commit to the repository.

Elements of a knowledge graph for software vulnerabilities

A knowledge graph arranges the entities and relationships into a formal structure. The entities for package dependencies and vulnerabilities are:

  • Source code repository (repo): An entity that hosts one or more packages.
  • Package: A module or submodule that provides a reusable piece of software.
  • Version: A tag associated with a repo or package that identifies the unique instance or release of the source.
  • CVE: A security-related flaw associated with versions of a repo or package.

The relationships between the entities are as follows:

  • Has version: A one-to-many relationship between package and version.
  • Has CVE: A one-to-many relationship between version and CVE.
  • Has fixed version: A one-to-one relationship between CVE and fixed version (if any).
  • Has dependency: A one-to-many relationship between repo/package and version.
  • Has transitive dependency: A one-to-many relationship between repo and version.

Figure 2 shows the entities and their relationships.

A knowledge graph showing the entities for package dependencies and vulnerabilities
Figure 2: The package dependencies and entities.

This exercise in data modeling has shown that vulnerability information for software packages can be articulated using a graph structure. Carrying out the analysis on real package data can form a knowledge graph you can use to protect your own software. Many graph databases, including open source ones, are available to represent your knowledge graph.

Accessing the repo node, the system can find all the package versions (direct and transitive) that are vulnerable. It can build the list of CVEs associated with these package versions. Further for each CVE, the graph can also provide the fixed version of the package (if any) to address the vulnerability. These data and much can be retrieved from a simple knowledge graph of common vulnerability data.

Conclusion

The concept of a knowledge graph can be applied to any domain that contains many entities and relationships between them. You can curate your data into a compact knowledge graph that easily provides the right information, which is more accurate, quick, and efficient. Also, the knowledge graph's dynamic nature keeps it fresh and relevant. Choosing the appropriate database engine according to your business needs will make this solution more robust and allow it to scale over time.

Related articles

Here are the related articles on this approach:

Last updated: February 5, 2024