Kafka Connect: Build and Run Data Pipelines
Data is often spread across many different systems, which can make it hard for organizations to use it to gain insights and provide innovative services to their customers.
Kafka Connect is a framework for integrating other systems with Apache Kafka, allowing data to be easily moved, reused, combined, or processed. You can use Kafka Connect to stream changes out of a database and into Kafka, enabling other services to easily react in real time.
With this practical guide, authors Mickael Maison and Kate Stanley show data engineers, site reliability engineers, and application developers how to build data pipelines between Kafka clusters and a variety of data sources and sinks. You will learn how to use connectors, configure the Kafka Connect framework, monitor and operate it in production, and even develop your own source and sink connectors.
- Get an introduction to Kafka Connect’s capabilities and use cases
- Design data and event streaming pipelines that use Kafka Connect
- Configure and operate Kafka Connect environments at scale
- Deploy secured and highly available Kafka Connect clusters
- Build sink and source connectors and single message transforms and converters
About the authors
Mickael Maison is a principal software engineer at Red Hat, and chair of the project management committee for Apache Kafka. He writes a monthly Kafka digest for Red Hat Developer.
Kate Stanley is a principal software engineer at Red Hat, a technical speaker, and a Java Champion.
Declarative Pipeline Definition
Kafka Connect allows you to declaratively define your pipelines. This means that by combining connector plug-ins, you can build powerful data pipelines without writing any code. Pipelines are defined using JSON (or properties files, in standalone configuration) that describes the plug-ins to use and their configurations. This allows data engineers to focus on their use cases and abstract the intricacies of the systems they are interacting with.
To define and operate pipelines, Kafka Connect exposes a REST API. This means you can easily start, stop, configure, and track the health and status of all your data pipelines.
Once a pipeline is created via the REST API, Kafka Connect automatically instantiates the necessary plug-ins on the available workers in the Connect cluster.
Part of Apache Kafka
Kafka Connect is part of the Apache Kafka project and is tailor-made to work with Kafka. Apache Kafka is an open source project, which means Kafka Connect benefits from a large and active community. As mentioned, there are hundreds of available plug-ins for Kafka Connect that have been created by the community. Kafka Connect receives improvements and new features with each Kafka release. These changes range from usability updates to alterations that allow Kafka Connect to take advantage of the latest Kafka features.
For developers and administrators who already use and know Kafka, Kafka Connect provides an integration option that doesn’t require a new system and reuses many of the Kafka concepts and practices. Internally, Kafka Connect uses regular Kafka clients, so it has a lot of similar configuration settings and operation procedures.
Although it’s recommended to always run the latest version of Kafka and Kafka Connect, you aren’t required to do so. The Kafka community works hard to make sure that older clients are supported for as long as possible. This means you are always able to upgrade your Kafka and Kafka Connect clusters independently. Similarly, the Kafka Connect APIs are developed with backward compatibility in mind. This means you can use plug-ins that were developed against an older or newer version of the Kafka Connect API than the one you are running.
When Kafka Connect is run in distributed mode, it needs somewhere to store its configuration and status. Rather than requiring a separate storage system, Kafka Connect stores everything it needs in Kafka.
Now that you understand what Kafka Connect is, let’s go over some of the use cases where it excels.
Kafka Connect can be used for a wide range of use cases that involve getting data into or out of Kafka. In this section we explore Kafka Connect’s most common use cases and explain the benefits they provide for managing and processing data.
The use cases are:
- Capturing database changes
- Mirroring Kafka clusters
- Building data lakes
- Aggregating logs
- Modernizing legacy systems
Capturing Database Changes
A common requirement for data pipelines is for applications to track changes in a database in real time. This use case is called change data capture (CDC).
There are a number of connectors for Kafka Connect that can stream changes out of databases in real time. This means that instead of having many applications querying the database, you only have one; Kafka Connect. This reduces the load on the database and makes it much easier to evolve the schema of your tables over time.
Kafka Connect can also transform the data by imposing a schema, validating data, or removing sensitive data before it is sent to Kafka. This gives you better control over other applications’ views of the data.
There is a subset of connector plug-ins that remove the need to query the database at all. Instead of querying the database, they access the change log file that keeps a record of updates, which is a more reliable and less resource-intensive way to track changes.
The Debezium project provides connector plug-ins for many popular databases that use the change log file to generate events. In Chapter 5, we demonstrate two different ways to capture changes from a MySQL database: using a Debezium connector, and using a JDBC connector that performs query-based CDC.
Mirroring Kafka Clusters
Another popular use case of Kafka Connect is to copy data from one Kafka cluster to another. This is called mirroring and is a key requirement in many scenarios, such as building disaster recovery environments, migrating clusters, or doing geo-replication. Although Kafka has built-in resiliency, in production-critical deployments it can be necessary to have a recovery plan in case your infrastructure is affected by a major outage. Mirroring allows you to synchronize multiple clusters to minimize the impact of failures.
You might also want your data available in different clusters for other reasons. For example, you might want to make it available to applications running in a different data center or region, or to have a copy with the sensitive information removed.
The Kafka project provides MirrorMaker to mirror data and metadata between clusters. MirrorMaker is a set of connectors that can be used in various combinations to fulfill your mirroring requirements. We cover how to correctly deploy and manage these in Chapter 6.
Building Data Lakes
You can use Kafka Connect to copy data into a purpose-built data lake or archive it to cost-effective storage like Amazon Simple Storage Service (Amazon S3). This is especially interesting if you need to keep large amounts of data or keep data for a long time (e.g., for auditing purposes). If the data is needed again in the future you can always import it back with Kafka Connect.