Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

5 steps to build a self-healing server with Alertmanager

October 4, 2023
Nagesh Rathod
Related topics:
ContainersEvent-DrivenSecurity
Related products:
Red Hat Ansible Automation Platform

Share:

    In today's fast-paced world, server downtime can have severe consequences for businesses. Ensuring high availability and rapid recovery is essential for maintaining uninterrupted services. In this article, we will explore how to create a self-healing server using the event-driven architecture of Red Hat Ansible Automation Platform and integrate it with Alertmanager for efficient monitoring and alerting.

    Prerequisites

    • Install Ansible Automation Platform and Ansible Rulebook.
    • Podman and podman-compose
    • These ports must be opened on the server side: 5000, 9090, 9093, 22.

    The concepts of event driven and self healing

    The event-driven architecture of Ansible Automation Platform enables servers to respond to events and take predefined actions automatically. It utilizes event-driven automation and monitoring to detect and remediate issues in real time, leading to a self-healing infrastructure.

    To learn more about the concept of Event-Driven Ansible, please read my previous article. You can pull the code from our GitHub repository.

    1. Install Prometheus and Alertmanager

    Launching the containers with podman-compose will enable us to install prometheus and Alertmanager, using the following podman-compose.yaml file.

    version: '3'
    
    services:
      prometheus:
        image: prom/prometheus:v2.30.3
        ports:
          - 9090:9090
        volumes:
          -./prometheus:/etc/prometheus
          - prometheus-data:/prometheus
        command: --web.enable-lifecycle  --config.file=/etc/prometheus/prometheus.yml
    
      alertmanager:
        image: prom/alertmanager:v0.23.0
        restart: unless-stopped
        ports:
          - "9093:9093"
        volumes:
          - "./alertmanager:/config"
          - alertmanager-data:/data
        command: --config.file=/config/alertmanager.yml --log.level=debug
    
    volumes:
      alertmanager-data:
    
      prometheus-data:

    The Alertmanager for Event-Driven Ansible should be configured in the receiver section, as shown in the alertmanager.yml file. Include the IP address of the server from which the rulebook is triggered in the webhook section. For instance, if you are running rulebooks on your local machine, the time IP should be http://192.168.1.65:5000/alerts, or if you have a remote server, then add the public IP of that server http:// 123.345.9.56:5000/alerts.

    alertmanager.yml:

    route:
      group_by: [ alertname ]
      receiver: 'EDA' # default receiver
      repeat_interval: 24h
      routes:
    
    receivers:
      - name: 'EDA'
        webhook_configs:
          - url: 'http://172.123.170.87:5000/alerts'

    To launch the file, use the following command:

    podman-compose up -d

    Check that Alertmanager and Prometheus are running:

    podman ps
    CONTAINER ID   IMAGE                       COMMAND                  CREATED          STATUS          PORTS                                       NAMES
    254000d2a108   prom/alertmanager:v0.23.0   "/bin/alertmanager -..."   15 seconds ago   Up 13 seconds   0.0.0.0:9093->9093/tcp, :::9093->9093/tcp   self-healing-server_alertmanager_1
    277f1c6da0cd   prom/prometheus:v2.30.3     "/bin/prometheus --w..."   15 seconds ago   Up 14 seconds   0.0.0.0:9090->9090/tcp, :::9090->9090/tcp   self-healing-server_prometheus_1

    Search for the http://192.168.1.22:9090 site and check that prometheus is up and running. After accessing Prometheus, check out the Alertmanager dashboard: http://192.168.1.22:9093

    2. Write the rulebook

    The basic principle of any rulebook is source - rule - action. So the following rulebook also contains the same conditions. For a self-healing use case, we will use rules with conditions to trigger rulebooks for specific conditions.

    ---
    - name: Automatic Remediation of a webserver
      hosts: localhost
      sources:
        - name: listen for alerts
          ansible.eda.alertmanager:
            host: 0.0.0.0
            port: 5000
      rules:
        - name: server down
          condition: event.alert.labels.job == "server" and event.alert.status == "firing"
          action:
            run_playbook:
              name: remediation-playbooks/server-playbook.yml
    
        - name: Storage full on server
          condition: event.alert.labels.job == "storage" and event.alert.status == "firing"
          action:
            run_playbook:
              name: remediation-playbooks/storage-playbook.yml
    
        - name: memory full on server
          condition: event.alert.labels.job == "memory" and event.alert.status == "firing"
          action:
            run_playbook:
              name: remediation-playbooks/memory-playbook.yml
    
        - name: ssh server down
          condition: event.alert.labels.job == "ssh" and event.alert.status == "firing"
          action:
            run_playbook:
              name: remediation-playbooks/ssh-playbook.yml
    
        - name: CPU full on server
          condition: event.alert.labels.job == "cpu" and event.alert.status == "firing"
          action:
            run_playbook:
              name: remediation-playbooks/cpu-playbook.yml

    For a self-healing server, we must list all conditions or scenarios where we can anticipate the server facing issues like full storage, memory fully utilized, and so on.

    Accordingly, we have to find the remediation of that issue and create an Ansible Playbook that resolves that issue after triggering that issue without manual intervention.

    Create the inventory file with localhost as host:

    localhost 

    3. Run Ansible Rulebook

    Use the ansible-rulebook command to run the rulebook:

    ansible-rulebook --rulebook ansible-rulebook.yaml -i inventory -v
    05:13:46,294 - ansible_rulebook.app - INFO - Starting sources
    05:13:46,294 - ansible_rulebook.app - INFO - Starting rules
    05:13:46,294 - ansible_rulebook.engine - INFO - run_ruleset
    05:13:47 496 [main] INFO org.drools.ansible.rulebook.integration.api.rulesengine.AbstractRulesEvaluator - Start automatic pseudo clock with a tick every 100 milliseconds
    05:13:48,402 - ansible_rulebook.engine - INFO - load source filters
    05:13:48,403 - ansible_rulebook.engine - INFO - loading eda.builtin.insert_meta_info
    05:13:48,887 - ansible_rulebook.engine - INFO - Calling main in ansible.eda.alertmanager
    05:13:48,890 - ansible_rulebook.engine - INFO - Waiting for all ruleset tasks to end
    05:13:48,890 - ansible_rulebook.rule_set_runner - INFO - Waiting for actions on events from Automatic Remediation of a webserver
    05:13:48,890 - ansible_rulebook.rule_set_runner - INFO - Waiting for events, ruleset: Automatic Remediation of a webserver
    05:13:48 891 [drools-async-evaluator-thread] INFO org.drools.ansible.rulebook.integration.api.io.RuleExecutorChannel - Async channel connected

    It will wait for the Alertmanager alert to fire and status changes to firing. Then only the Ansible rule will trigger. Conditions also play a major role. We are using the and functionality of two functions. The first one is for firing status, and the second one is for job label match.

    We can assign different labels to different applications and we can trigger the rulebook based on status.

    4. Trigger the alerts of Alertmanager

    The status of alert changed from stable to firing condition as shown in Figure 1. In labels as a job, we can change the application name and trigger the specific remediation playbook.

    A screenshot of the Alertmanager rules dashboard showing triggered rules.
    Figure 1: The Alertmanager rules dashboard showing triggered rules.

    Go back to the Ansible Rulebook CLI terminal where the rulebook was run. The logs show that the rule was triggered and the remediation playbook was run.

    2023-06-20 05:26:46,907 - aiohttp.access - INFO - 4.246.213.96 [20/Jun/2023:05:26:46 +0000] "POST /alerts HTTP/1.1" 202 164 "-" "Alertmanager/0.23.0"
    2023-06-20 05:26:46,935 - ansible_rulebook.rule_generator - INFO - calling restart web server
    2023-06-20 05:26:46,938 - ansible_rulebook.rule_set_runner - INFO - call_action run_playbook
    2023-06-20 05:26:46,938 - ansible_rulebook.rule_set_runner - INFO - substitute_variables
    2023-06-20 05:26:46,938 - ansible_rulebook.rule_set_runner - INFO - action args: {'name': 'say-what.yml'}
    2023-06-20 05:26:46,938 - ansible_rulebook.builtin - INFO - running Ansible playbook: say-what.yml
    2023-06-20 05:26:46,942 - ansible_rulebook.builtin - INFO - ruleset: Automatic Remediation of a webserver, rule: restart web server
    2023-06-20 05:26:46,942 - ansible_rulebook.builtin - INFO - Calling Ansible runner
    2023-06-20 05:26:46,943 - aiohttp.access - INFO - 4.246.213.96 [20/Jun/2023:05:26:46 +0000] "POST /alerts HTTP/1.1" 202 164 "-" "Alertmanager/0.23.0"

    5. Containerize Event-Driven Ansible functionality

    You can also create containers on Event-Driven Ansible using the following Containerfile:

    FROM registry.access.redhat.com/ubi9-minimal
    RUN microdnf install java-17 python3 gcc python3-devel -y && microdnf clean all && python -m ensurepip --upgrade && pip3 install ansible ansible-rulebook asyncio aiokafka aiohttp aiosignal
    ENV JAVA_HOME="/usr/lib/jvm/jre-17"
    RUN mkdir /eda-ansible
    RUN ansible-galaxy collection install ansible.eda
    WORKDIR /eda-ansible
    COPY. /eda-ansible
    CMD ansible-rulebook -i inventory --rulebook ansible-rulebook.yaml  --verbose

    Integrate Event-Driven Ansible in the podman-compose file. Compose files will create a container and run it.

    version: '3'
    services:
      event-driven:
      build:.
      ports:
        - 5000:5000
      depends_on:
        - prometheus
        - alertmanager

    To run Prometheus, Alertmanager, and the event-driven rulebook, use the following:

    podman-compose up -d

    You will get the same results.

    Continue your automation journey with Ansible Automation Platform

    Get started with Ansible Automation Platform by exploring interactive hands-on labs. Download Ansible Automation Platform at no cost and begin your automation journey.

    Related Posts

    • Alertmanager Watchdog monitoring with Nagios passive checks

    • Designing an event-driven process at scale: Part 2

    • Event-driven business automation powered by cloud-native Java

    • Example of using Ansible to update Container Native Storage

    • Enhance Ansible development experience with Lightspeed

    • 6 steps to install Ansible Automation Platform 2.3 on RHEL

    Recent Posts

    • Container starting and termination order in a pod

    • More Essential AI tutorials for Node.js Developers

    • How to run a fraud detection AI model on RHEL CVMs

    • How we use software provenance at Red Hat

    • Alternatives to creating bootc images from scratch

    What’s up next?

    Automation at the edge

    Discover how edge computing and automation can help your organization improve scalability, security, agility, and overall efficiency. Automation at the edge illustrates the benefits of edge automation in seven industry use cases.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue