Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Working with big spatial data workflows (or, what would John Snow do?)

May 5, 2020
Maria Arias de Reyna Dominguez
Related topics:
JavaArtificial intelligenceMicroservices

    With the rise of social networks and people having more free time due to isolation, it has become popular to see lots of maps and graphs. These are made using big spatial data to explain how COVID-19 is expanding, why it is faster in some countries, and how we can stop it.

    Some of these maps and graphs are made by inexperienced amateurs that have access to huge amounts of raw and processed big spatial data. But most of them are not sure how to handle that data. A few unaware amateurs mix different sources without caring about homogenizing the data first. Some others mix old data with new. And finally, most forget to add relevant variables because this is too much data to handle manually.

    How would a professional handle all of this?

    The cholera outbreak

    In situations where we have to handle big spatial data, I can't help but wonder: What would John Snow do? I'm not talking about that warrior in the cold north fighting zombies. I'm talking about the original John Snow, an English doctor from the XIX century that used spatial data to study a cholera outbreak.

    Let's go back to 1854, London, where a cholera outbreak was taking heavy casualties. Most doctors at the time, unaware of germs, thought it was caused by miasma, a kind of bad air that polluted people, making them ill.

    John Snow data analysis

    But John was not convinced by that theory. He had a hypothesis on what the real cause could be, suspecting water-related issues. He collected data on where the people infected lived and where they got their water from and ran some spatial data analysis to prove those ideas. Figure 1 shows one of his original maps.

    Original map by John Snow showing the clusters of cholera cases (indicated by stacked rectangles) in the London epidemic of 1854
    Original map by John Snow showing the clusters of cholera cases in the London epidemic of 1854
    Figure 1: Original map by John Snow showing the clusters of cholera cases in the London epidemic of 1854.">

    With that accurate data, he was able to generate a cluster map showing the spread of the disease. This work helped him prove his theories on cholera's water origin. He had only a few sources of data, but they were all homogeneous. Plus, he was able to collect data directly in the field, making sure it was accurate and met his needs.

    It's important to notice that because he used the right data, he arrived at the right conclusions. He studied the outliers, like those people drinking water from a different source than what should have been the closest to their homes. Thus he was able to conflate the data with the proper sources, curating it. Homogenizing and conflating the sources of data is a relevant step to arrive at the right conclusions.

    John Snow had to manually conflate and analyze all of the data and it was a good choice. The amount of data he handled was fit for working with pen and paper. But in our case, when we try to conflate all the sources available worldwide, what we are really facing is big spatial data, which is impossible to handle manually.

    Big spatial data

    Not only do we have the specific related data, but we also have data about different isolation or social distancing norms, health care, personal savings, access to clean water, diet, population density, population age, and previous health care issues. The amount of related data available is huge.

    Remember, if your data fits into a hard disk, that's hardly big data. We are talking here about the amount of data that calls for unending data storage on server farms. No analyst can update, conflate, and analyze all that data manually. We need tools, good tools, to be able to deliver reliable results.

    Consider that different data collectors update their data in almost real-time but at different rates, and each country has its own statistics and its own way to measure each variable. So, you need to transform and homogenize before conflating those sources.

    How can we keep up-to-date without going crazy? Before you can finish even half of the workflow shown in Figure 2, there's freshly new data waiting for you.

    Update → Homogenize → Conflate → Analyze
    We need to run this workflow continuously
    Figure 2: We need to run this workflow continuously to always use the newest big spatial data available.">

    What would John Snow do? Well, I'm quite sure he would like all of us to use the proper tools for the work. That's why it's called Location Intelligence.

    Middleware to the rescue

    Regarding those four steps, there are three that can be automated: update, homogenize, and conflate. All of those are tedious and repetitive tasks that make developers quickly jump into scripting rough code. And we know what happens when we write quickly supporting code: We tend to make the same mistakes that others already fixed.

    Well, here we are lucky. We have several free and open source software libraries and frameworks that can help us through these tasks. These tools can be found in the Red Hat Fuse Integration Platform.

    Apache Camel

    Our first option should always be using Apache Camel to help us create complex data workflows. With this framework, we can periodically extract the latest data from different sources, transform, and conflate automatically. We can even use Camel K and leave it running on some Kubernetes container while we focus on the non-automatable steps of our work.

    Defining workflows in Camel is easy. You can use different common languages such as Java, Javascript, Groovy, or a specific domain-specific language (DSL). With Camel's hundreds of components, you can feed your workflow with almost any source of data, process the data, and output the processed data in the format your analysis requires.

    Syndesis

    For those data analysts that are less tech-savvy and feel that writing Camel scripts is too complex, we also have Syndesis. With Syndesis you can define data workflows in a more visual way, as you can see in Figure 3.

    Syndesis Integrations Overview
    We can define several processes on Syndesis, each running based on a different trigger.
    Figure 3: We can define several processes on Syndesis, each running based on a different trigger.">

    This means you can update that big spatial data without having to write a single line of code. Or, maybe you just want to speed up the workflow creation process to jump directly into the analysis.

    We can either create one single workflow or break it down to several workflows, as shown in Figure 4. For example, the first process could be triggered by a timer to download different data sources and send that raw data to a Kafka broker. Then, a second process could listen to that broker, transform and homogenize the data previously downloaded, and store it on some common data storage. Finally, a third process can take several sources of data from that common storage with homogenized data, conflate those sources, and prepare the data for further analysis or exposition.

    Syndesis Integration Creation View
    We can easily add steps to the workflow using that plus button.
    Figure 4: We can easily add steps to the workflow using that plus button.">

    Note that each step can filter, transform, and use data from different sources, allowing us to create complex workflows in a simple and visual way. We can run the data through different APIs, XSLT transformations, data mapping, and filters to make sure we end up with data ready for analysis.

    The final touch

    Now that we have our data updated, homogenized, transformed, and conflated, we can start the analysis. As both Camel and Syndesis can provide the output in different formats, we can connect it to any software we need to do this analysis. From databases like PostgreSQL to XML-based data formats like KML, we could feed our analysis tools the way we need.

    For example, we can use QGIS, which is an advanced desktop application for data analysis. You can add all those already transformed and conflated big spatial data sources to QGIS to create beautiful graphs and maps as outputs. After that, you can publish your maps with OpenLayers or Leaflet.

    Make John Snow proud! And do it using free and open source software.

    Last updated: June 26, 2020

    Recent Posts

    • Testing infrastructure red teaming with abliterated models

    • Build an enterprise RAG system with OGX

    • Solutions for SELinux MCS challenges with GitLab runners

    • MCP servers vs. skills: Choosing the right context for your AI

    • How to route external and local LLMs with Models-as-a-Service

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.