Consistent access and delivery with Data Integration

Data integration patterns help create a unified, accurate, and consistent view of enterprise data within an organization. This data is often dissimilar, living in different locations and being stored in a variety of formats.

The approaches used to achieve data integration goals will depend largely on the Quality-of-Service (QoS) and usage characteristics surrounding different sets of data. A data integration strategy helps to logically—and perhaps also physically—combine different sets of enterprise data sources to expose the data services needed by your organization.

Understanding data integration patterns and using them effectively can help organizations create an effective data integration strategy. In the sections that follow, we will detail these patterns.

Legacy data gateways for microservices

Application architecture evolution has fragmented the backend implementation into independent microservices and functions. However, there's still a gap in the way this evolution has dealt with data because evolution tends to avoid dealing with static environments.

At the same time, microservices encourage developers to create new polyglot data persistence layers that then, need to be composite to deliver business value. How can we apply the knowledge from API gateways to these new data stores?

In this discussion of legacy data, Hugo Guerrero talks about the behavior of data gateways and API gateways, the different data gateway types and their architectures, and the extended data-proxy for hybrid cloud deployments.

Pattern 1. Data consolidation

Data consolidation involves designing and implementing a data integration process that feeds a datastore with complete, enriched data. This approach allows for data restructuring, a reconciliation process, thorough cleansing, and additional steps for aggregation and further enrichment.

Extract, transform, load (ETL)

ETL offloaded the transformation of raw data to usable data from the target datastore. This transformation process can become a bottleneck. In cloud computing, there's no added benefit, such as the reduction in target server loads.

Extract, load transform (ELT)

ELT is highly scalable–store as much data as you need in its raw form and get it to the target quickly. No specialized infrastructure for transformation processes is necessary before it lands at its destination.

Pattern 2. Data federation

Data federation uses a pull approach where data is retrieved from the underlying source systems on-demand. This pattern provides real-time access to data. Data federation creates a virtualized view of the data with no data replication or moving of the source system data.

Composite service

A composite service implements the aggregator pattern. It combines the data from different, distinct services in a meaningful way and serves this response to the consuming application.

Data virtualization or Enterprise Information Integration (EII)

Data virtualization (Ell) cpmbines large sets of diverse data sources in a way that makes them appear to a data consumer as a single, uniform data source. It uses data abstraction to provide a common data access layer.

Pattern 3. Data propagation

Data propagation involves the promotion of data updates on two levels. At the application level, an event in the source application triggers processing in one or more target applications. At the datastore level, an event in the source system triggers updates in the source datastore. These change events are then replicated in near real-time to one or more target datastores.

Enterprise Application Integration (EAI)

EAI is distributed, lightweight, and scalable for elastic operating environments—the integration itself may be deployed as a containerized application.

Enterprise Data Replication (EDR)

In distributed and microservices architectures, replication allows applications to be more reliable.

The data service needs may be replicated and colocated with it and stored in a manner that is more usable by that particular service. This reduces overhead and latency.

Red Hat build of Apache Camel

Apache Camel is an open source integration framework that implements EIPs with mature and robust ready-to-use building blocks, enabling developers to rapidly create data flows and easily test and maintain them.

Data integration common practices

Change data capture

Change data capture detects data change events in a source datastore and triggers an update process in another datastore or system. CDC is usually implemented as trigger-based or log-based. In the trigger-based approach, transaction events are logged in a separate shadow table that can be replayed to copy those events to the target system on a regular basis. Log-based CDC—also known as transaction log tailing—identifies data change events by scanning transaction logs. This approach is often used as it can be applied to many data change scenarios and can support systems with extremely high transaction volumes because of the minimal amount of overhead it involves.

Watch webinar

Event sourcing

Event sourcing is a pattern that makes sure that all changes to an application’s state are stored as a sequence of events. These events can then be used for temporal queries allowing for the reconstruction of past states and activity replay. This pattern is useful for creating audit logs, debugging, and use cases that require the reconstruction of the state at a specific point.

Streaming data and event stream processing

ESP involves taking action on a series of data points that originate from a system that continuously creates data. In this context, an event is a data point in the system and the stream is the continuous delivery of those events. This series of events is also referred to as streaming data. The types of actions that are taken as a result of these events include aggregations, analytics, transformations, enrichment, and ingestion into another datastore.

Start tutorial

Distributed caching and in-memory data grids

The concept of caching is to provide storage capacity for data on a system that's used to serve future requests more quickly. Data that's stored in cache is placed there because it's frequently accessed or contains duplicated copies of data stored in another datastore. The overarching goal of caching is to improve performance.

Data replication

CDC can be used for data replication to multiple databases, data lakes, or data warehouses, to ensure each resource has the latest version of the data. In this way, CDC can provide multiple distributed and even siloed teams with access to the same up-to-date data.

Click here to see a diagram

Auditing

Facing today's strict data compliance requirements, and heavy penalties for noncompliance, it is essential to save a history of changes made to your data. CDC can be used to save data changes for auditing or archiving requirements.

Click here to see a diagram

Microservice data exchange

CDC can be used to sync microservices with monolithic applications, enabling the seamless transfer of data changes from legacy systems to microservices-based applications.

Click here to see a diagram

Mono-to-micro Strangler Pattern

Through an incremental approach, you can take scoped components and move them to a new microservices architecture. Use CDC to stream changes from the monolithic database over to the microservices database and the other way around.

Click here to see a diagram

Battle of the in-memory data stores

Have you ever wondered what the relative differences are between two of the more popular open source, in-memory data stores, and cachés? The caché is a smaller, faster memory component inserted between the CPU and the main memory that stores its data on disks for retrieval, while in-memory data stores depend on machine memory to store retrievable data.

In this DevNation Tech Talk, the DevNation team describes those differences and more importantly, provides live demonstrations of the key capabilities that could have a major impact on your architectural

Data stores vs cachés