Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How incident detection simplifies OpenShift observability

October 3, 2024
Ivan Necas
Related topics:
KubernetesObservability
Related products:
Red Hat OpenShiftRed Hat OpenShift Container Platform

Share:

    As the volume of observability signals increases, there is an urgent need to reduce the noise out of the data and help Red Hat OpenShift users deal with such complexity. Identifying the information that matters can suddenly become costly and time-consuming. Motivated by the goal of offering an effective solution to the problem, the Red Hat observability group has been launching and investing on its own troubleshooting journey initiative. 

    The observability troubleshooting journey wants to provide a systematic approach to OpenShift users for identifying and resolving cluster issues in a simplified way, reducing the number of manual steps and cognitive load usually required to fulfill these tasks. In short, the observability troubleshooting journey initiative is composed of a series of analytical tools that aim at reducing the overall mean time to detection (MTTD) and mean time to resolution (MTTR). 

    Currently, two troubleshooting features have been released as part of this journey, both of them as developer preview for our OpenShift users: incident detection, which we describe in this article, and observability signal correlation for Red Hat OpenShift (note that an enhanced developer preview has been released with the 0.3.0 release of the cluster observability operator.) 

    Incident detection

    From day one, OpenShift 4 has been built with observability in mind. In the distributed world of the Kubernetes ecosystem, this has huge benefits when it comes to troubleshooting and understanding the behavior of the system. Not only do the components provide enough raw observability data, they also define alerts and other indicators of its health.

    However, the distributed nature brings some challenges as well. Especially when some initial issue affects the health of multiple components, one can get overwhelmed with the amount of alerts popping up. Figure 1 shows 43 alerts, but there are not really 43 individual issues. There are instead just a few issues that eventually lead to this alert burst. 

    43 alerts, but there are not really 43 individual issues.
    Figure 1: 43 alerts.

    As Red Hat has seen several of these situations, we started looking into ways to help you to get on top of the alerts and guide your attention to the right place. That’s where the idea of incident detection came from.

    The main idea behind incident detection is the observation that if multiple alerts start popping up around the same time, it’s very likely they are somehow connected. Of course in the real world, while it works most of the time, in specific cases it can still be just a coincidence and we try to take this into account as well. What we do is observe the alerts as they arrive and though various heuristics assign the alerts into groups that we call incidents.

    This grouping data is available as Prometheus metrics. As part of the developer preview, we provide a prototype UI to better visualize the concepts (more about installation later). Eventually, instead of 43 individual alerts, you get a timeline of 4 incidents that the cluster has been affected with. See Figure 2.

    A timeline of 4 incidents that the cluster has been affected with.
    Figure 2: Incidents timeline.

    The color coding of the lines in the graph corresponds to the severity: in this case, you see the top incident starting as a warning, while it evolved into critical state over time.

    By clicking the incident line, you can see the timeline of individual alerts that are part of the incident (Figure 3).

    The timeline of individual alerts that are part of the incident.
    Figure 3: Incidents timeline of individual alerts.

    The functionality doesn’t stop there. We had numerous discussions to better understand the troubleshooting process, including subject matter experts in Red Hat (including the SRE team responsible for Red Hat managed OpenShift offerings) as well as among our customers. One of the things the discussions revealed was the crucial role of knowing what components got affected during the incident. Not all components are born equal. For example, affected etcd can have a big impact on many other parts of the system, and therefore it makes sense to focus on it before any other issues being reported. Therefore, we’ve included a mechanism to assign individual alerts to corresponding components, as well as ranking of those components.

    As a result, you get a list of alerts belonging to the incident categorized by the components, with the most important component at the top. See Figure 4.

    A list of alerts belonging to the incident categorized by the components.
    Figure 4: List of alerts.

    We see that, although there is a critical alert in the monitoring component, there is some issue at the top in the compute layer, and it’s suggested to check it before the rest. Expanding the compute component reveals information about the alerts related to the underlying nodes infrastructure, which is the root cause for the rest of the alerts in the incident. You can then just follow the link to the alert details to continue troubleshooting the issue, as shown in Figure 5.

    Information about the alerts related to the underlying nodes infrastructure.
    Figure 5: Information about the alerts.

    Installation

    Disclaimer

    The feature is developer preview only. Consult the developer preview support statement to learn more.

    Currently, the incident detection functionality is available as a developer preview. We provide the source code, as well as pre-built container images, with deployment manifests to get everything up and running in your environment.

    1. Clone the Git repository with the back-end code that also includes the deployment manifests:

      git clone https://github.com/openshift/cluster-health-analyzer.git -b dev-preview
              cd cluster-health-analyzer
    2. Apply the manifests against your cluster:

      oc apply -f manifests/backend -f manifests/frontend
    3. Once the deployment is complete, the incidents UI prototype should be available here (update according to your domain). Alternatively, you can go to the cluster-health-analyzer project in the OpenShift console, find the corresponding Route, and click the link in the Location row to open up the UI.

      The incident data should start getting populated after the installation. It’s advised to keep it running for a few hours so that there is enough data to explore. 

    The deployment scripts will: 

    • Deploy both the backend that’s responsible for incident detection itself as well as the components mapping.
    • Configure monitoring stack to scrape the data from the backend to make it available via Prometheus.
      • Not that as of the current version, this requires user-workload monitoring to be enabled and it’s done as part of the provided manifests. In future versions, it would rely solely on the in-cluster platform monitoring stack.
    • Deploy the frontend UI prototype and make it available (behind authentication screen).
      • The user needs to have access to in cluster monitoring (e.g., via binding to cluster-monitoring-view)
      • In the future, the experience is expected to be provided directly in the OpenShift web console. For now, we’re providing the functionality as a standalone web application. 

    What’s next?

    The main purpose of releasing a developer preview of incident detection is for you to get an early access of the functionality and ability to gather more feedback to incorporate in further development. 

    As of the functionality itself, besides working on the productization efforts, we would like to look into making available a similar functionality in the multi-cluster scenario with Red Hat Advanced Cluster Management for Kubernetes, integrating it more with the rest of the ecosystem and provide more troubleshooting capabilities for the incidents leveraging the underlying observability stack.

    Do you want to engage with us and provide feedback? Do so in our feedback form.

    Related Posts

    • Network observability using TCP handshake round-trip time

    • Network observability with eBPF on single node OpenShift

    • Introduction to microservices observability with Eclipse MicroProfile

    • Packet capture using Network Observability eBPF Agent

    Recent Posts

    • Profiling vLLM Inference Server with GPU acceleration on RHEL

    • Network performance in distributed training: Maximizing GPU utilization on OpenShift

    • Clang bytecode interpreter update

    • How Red Hat has redefined continuous performance testing

    • Simplify OpenShift installation in air-gapped environments

    What’s up next?

    Learn oc commands for managing an application’s lifecycle in this updated Red Hat OpenShift cheat sheet.

    Get the cheat sheet
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue