Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Working with big spatial data workflows (or, what would John Snow do?)

May 5, 2020
Maria Arias de Reyna Dominguez
Related topics:
JavaArtificial intelligenceMicroservices

Share:

    With the rise of social networks and people having more free time due to isolation, it has become popular to see lots of maps and graphs. These are made using big spatial data to explain how COVID-19 is expanding, why it is faster in some countries, and how we can stop it.

    Some of these maps and graphs are made by inexperienced amateurs that have access to huge amounts of raw and processed big spatial data. But most of them are not sure how to handle that data. A few unaware amateurs mix different sources without caring about homogenizing the data first. Some others mix old data with new. And finally, most forget to add relevant variables because this is too much data to handle manually.

    How would a professional handle all of this?

    The cholera outbreak

    In situations where we have to handle big spatial data, I can't help but wonder: What would John Snow do? I'm not talking about that warrior in the cold north fighting zombies. I'm talking about the original John Snow, an English doctor from the XIX century that used spatial data to study a cholera outbreak.

    Let's go back to 1854, London, where a cholera outbreak was taking heavy casualties. Most doctors at the time, unaware of germs, thought it was caused by miasma, a kind of bad air that polluted people, making them ill.

    John Snow data analysis

    But John was not convinced by that theory. He had a hypothesis on what the real cause could be, suspecting water-related issues. He collected data on where the people infected lived and where they got their water from and ran some spatial data analysis to prove those ideas. Figure 1 shows one of his original maps.

    Original map by John Snow showing the clusters of cholera cases (indicated by stacked rectangles) in the London epidemic of 1854
    Original map by John Snow showing the clusters of cholera cases in the London epidemic of 1854
    Figure 1: Original map by John Snow showing the clusters of cholera cases in the London epidemic of 1854.">

    With that accurate data, he was able to generate a cluster map showing the spread of the disease. This work helped him prove his theories on cholera's water origin. He had only a few sources of data, but they were all homogeneous. Plus, he was able to collect data directly in the field, making sure it was accurate and met his needs.

    It's important to notice that because he used the right data, he arrived at the right conclusions. He studied the outliers, like those people drinking water from a different source than what should have been the closest to their homes. Thus he was able to conflate the data with the proper sources, curating it. Homogenizing and conflating the sources of data is a relevant step to arrive at the right conclusions.

    John Snow had to manually conflate and analyze all of the data and it was a good choice. The amount of data he handled was fit for working with pen and paper. But in our case, when we try to conflate all the sources available worldwide, what we are really facing is big spatial data, which is impossible to handle manually.

    Big spatial data

    Not only do we have the specific related data, but we also have data about different isolation or social distancing norms, health care, personal savings, access to clean water, diet, population density, population age, and previous health care issues. The amount of related data available is huge.

    Remember, if your data fits into a hard disk, that's hardly big data. We are talking here about the amount of data that calls for unending data storage on server farms. No analyst can update, conflate, and analyze all that data manually. We need tools, good tools, to be able to deliver reliable results.

    Consider that different data collectors update their data in almost real-time but at different rates, and each country has its own statistics and its own way to measure each variable. So, you need to transform and homogenize before conflating those sources.

    How can we keep up-to-date without going crazy? Before you can finish even half of the workflow shown in Figure 2, there's freshly new data waiting for you.

    Update → Homogenize → Conflate → Analyze
    We need to run this workflow continuously
    Figure 2: We need to run this workflow continuously to always use the newest big spatial data available.">

    What would John Snow do? Well, I'm quite sure he would like all of us to use the proper tools for the work. That's why it's called Location Intelligence.

    Middleware to the rescue

    Regarding those four steps, there are three that can be automated: update, homogenize, and conflate. All of those are tedious and repetitive tasks that make developers quickly jump into scripting rough code. And we know what happens when we write quickly supporting code: We tend to make the same mistakes that others already fixed.

    Well, here we are lucky. We have several free and open source software libraries and frameworks that can help us through these tasks. These tools can be found in the Red Hat Fuse Integration Platform.

    Apache Camel

    Our first option should always be using Apache Camel to help us create complex data workflows. With this framework, we can periodically extract the latest data from different sources, transform, and conflate automatically. We can even use Camel K and leave it running on some Kubernetes container while we focus on the non-automatable steps of our work.

    Defining workflows in Camel is easy. You can use different common languages such as Java, Javascript, Groovy, or a specific domain-specific language (DSL). With Camel's hundreds of components, you can feed your workflow with almost any source of data, process the data, and output the processed data in the format your analysis requires.

    Syndesis

    For those data analysts that are less tech-savvy and feel that writing Camel scripts is too complex, we also have Syndesis. With Syndesis you can define data workflows in a more visual way, as you can see in Figure 3.

    Syndesis Integrations Overview
    We can define several processes on Syndesis, each running based on a different trigger.
    Figure 3: We can define several processes on Syndesis, each running based on a different trigger.">

    This means you can update that big spatial data without having to write a single line of code. Or, maybe you just want to speed up the workflow creation process to jump directly into the analysis.

    We can either create one single workflow or break it down to several workflows, as shown in Figure 4. For example, the first process could be triggered by a timer to download different data sources and send that raw data to a Kafka broker. Then, a second process could listen to that broker, transform and homogenize the data previously downloaded, and store it on some common data storage. Finally, a third process can take several sources of data from that common storage with homogenized data, conflate those sources, and prepare the data for further analysis or exposition.

    Syndesis Integration Creation View
    We can easily add steps to the workflow using that plus button.
    Figure 4: We can easily add steps to the workflow using that plus button.">

    Note that each step can filter, transform, and use data from different sources, allowing us to create complex workflows in a simple and visual way. We can run the data through different APIs, XSLT transformations, data mapping, and filters to make sure we end up with data ready for analysis.

    The final touch

    Now that we have our data updated, homogenized, transformed, and conflated, we can start the analysis. As both Camel and Syndesis can provide the output in different formats, we can connect it to any software we need to do this analysis. From databases like PostgreSQL to XML-based data formats like KML, we could feed our analysis tools the way we need.

    For example, we can use QGIS, which is an advanced desktop application for data analysis. You can add all those already transformed and conflated big spatial data sources to QGIS to create beautiful graphs and maps as outputs. After that, you can publish your maps with OpenLayers or Leaflet.

    Make John Snow proud! And do it using free and open source software.

    Last updated: June 26, 2020

    Recent Posts

    • Create and enrich ServiceNow ITSM tickets with Ansible Automation Platform

    • Expand Model-as-a-Service for secure enterprise AI

    • OpenShift LACP bonding performance expectations

    • Build container images in CI/CD with Tekton and Buildpacks

    • How to deploy OpenShift AI & Service Mesh 3 on one cluster

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue