5 steps to build a self-healing server with Alertmanager

In today's fast-paced world, server downtime can have severe consequences for businesses. Ensuring high availability and rapid recovery is essential for maintaining uninterrupted services. In this article, we will explore how to create a self-healing server using the event-driven architecture of Red Hat Ansible Automation Platform and integrate it with Alertmanager for efficient monitoring and alerting.

Prerequisites

Install Ansible Automation Platform and Ansible Rulebook.
Podman and podman-compose
These ports must be opened on the server side: 5000, 9090, 9093, 22.

The concepts of event driven and self healing

The event-driven architecture of Ansible Automation Platform enables servers to respond to events and take predefined actions automatically. It utilizes event-driven automation and monitoring to detect and remediate issues in real time, leading to a self-healing infrastructure.

To learn more about the concept of Event-Driven Ansible, please read my previous article. You can pull the code from our GitHub repository.

1. Install Prometheus and Alertmanager

Launching the containers with podman-compose will enable us to install prometheus and Alertmanager, using the following podman-compose.yaml file.

version: '3'

services:
  prometheus:
    image: prom/prometheus:v2.30.3
    ports:
      - 9090:9090
    volumes:
      -./prometheus:/etc/prometheus
      - prometheus-data:/prometheus
    command: --web.enable-lifecycle  --config.file=/etc/prometheus/prometheus.yml

  alertmanager:
    image: prom/alertmanager:v0.23.0
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - "./alertmanager:/config"
      - alertmanager-data:/data
    command: --config.file=/config/alertmanager.yml --log.level=debug

volumes:
  alertmanager-data:

  prometheus-data:

Copy snippet

The Alertmanager for Event-Driven Ansible should be configured in the receiver section, as shown in the alertmanager.yml file. Include the IP address of the server from which the rulebook is triggered in the webhook section. For instance, if you are running rulebooks on your local machine, the time IP should be http://192.168.1.65:5000/alerts, or if you have a remote server, then add the public IP of that server http:// 123.345.9.56:5000/alerts.

alertmanager.yml:

route:
  group_by: [ alertname ]
  receiver: 'EDA' # default receiver
  repeat_interval: 24h
  routes:

receivers:
  - name: 'EDA'
    webhook_configs:
      - url: 'http://172.123.170.87:5000/alerts'

Copy snippet

To launch the file, use the following command:

podman-compose up -d

Copy snippet

Check that Alertmanager and Prometheus are running:

podman ps

Copy snippet

CONTAINER ID   IMAGE                       COMMAND                  CREATED          STATUS          PORTS                                       NAMES
254000d2a108   prom/alertmanager:v0.23.0   "/bin/alertmanager -..."   15 seconds ago   Up 13 seconds   0.0.0.0:9093->9093/tcp, :::9093->9093/tcp   self-healing-server_alertmanager_1
277f1c6da0cd   prom/prometheus:v2.30.3     "/bin/prometheus --w..."   15 seconds ago   Up 14 seconds   0.0.0.0:9090->9090/tcp, :::9090->9090/tcp   self-healing-server_prometheus_1

Copy snippet

Search for the http://192.168.1.22:9090 site and check that prometheus is up and running. After accessing Prometheus, check out the Alertmanager dashboard: http://192.168.1.22:9093

2. Write the rulebook

The basic principle of any rulebook is source - rule - action. So the following rulebook also contains the same conditions. For a self-healing use case, we will use rules with conditions to trigger rulebooks for specific conditions.

---
- name: Automatic Remediation of a webserver
  hosts: localhost
  sources:
    - name: listen for alerts
      ansible.eda.alertmanager:
        host: 0.0.0.0
        port: 5000
  rules:
    - name: server down
      condition: event.alert.labels.job == "server" and event.alert.status == "firing"
      action:
        run_playbook:
          name: remediation-playbooks/server-playbook.yml

    - name: Storage full on server
      condition: event.alert.labels.job == "storage" and event.alert.status == "firing"
      action:
        run_playbook:
          name: remediation-playbooks/storage-playbook.yml

    - name: memory full on server
      condition: event.alert.labels.job == "memory" and event.alert.status == "firing"
      action:
        run_playbook:
          name: remediation-playbooks/memory-playbook.yml

    - name: ssh server down
      condition: event.alert.labels.job == "ssh" and event.alert.status == "firing"
      action:
        run_playbook:
          name: remediation-playbooks/ssh-playbook.yml

    - name: CPU full on server
      condition: event.alert.labels.job == "cpu" and event.alert.status == "firing"
      action:
        run_playbook:
          name: remediation-playbooks/cpu-playbook.yml

Copy snippet

For a self-healing server, we must list all conditions or scenarios where we can anticipate the server facing issues like full storage, memory fully utilized, and so on.

Accordingly, we have to find the remediation of that issue and create an Ansible Playbook that resolves that issue after triggering that issue without manual intervention.

Create the inventory file with localhost as host:

localhost

Copy snippet

3. Run Ansible Rulebook

Use the ansible-rulebook command to run the rulebook:

ansible-rulebook --rulebook ansible-rulebook.yaml -i inventory -v

Copy snippet

05:13:46,294 - ansible_rulebook.app - INFO - Starting sources
05:13:46,294 - ansible_rulebook.app - INFO - Starting rules
05:13:46,294 - ansible_rulebook.engine - INFO - run_ruleset
05:13:47 496 [main] INFO org.drools.ansible.rulebook.integration.api.rulesengine.AbstractRulesEvaluator - Start automatic pseudo clock with a tick every 100 milliseconds
05:13:48,402 - ansible_rulebook.engine - INFO - load source filters
05:13:48,403 - ansible_rulebook.engine - INFO - loading eda.builtin.insert_meta_info
05:13:48,887 - ansible_rulebook.engine - INFO - Calling main in ansible.eda.alertmanager
05:13:48,890 - ansible_rulebook.engine - INFO - Waiting for all ruleset tasks to end
05:13:48,890 - ansible_rulebook.rule_set_runner - INFO - Waiting for actions on events from Automatic Remediation of a webserver
05:13:48,890 - ansible_rulebook.rule_set_runner - INFO - Waiting for events, ruleset: Automatic Remediation of a webserver
05:13:48 891 [drools-async-evaluator-thread] INFO org.drools.ansible.rulebook.integration.api.io.RuleExecutorChannel - Async channel connected

Copy snippet

It will wait for the Alertmanager alert to fire and status changes to firing. Then only the Ansible rule will trigger. Conditions also play a major role. We are using the and functionality of two functions. The first one is for firing status, and the second one is for job label match.

We can assign different labels to different applications and we can trigger the rulebook based on status.

4. Trigger the alerts of Alertmanager

The status of alert changed from stable to firing condition as shown in Figure 1. In labels as a job, we can change the application name and trigger the specific remediation playbook.

A screenshot of the Alertmanager rules dashboard showing triggered rules. — Figure 1: The Alertmanager rules dashboard showing triggered rules.

Go back to the Ansible Rulebook CLI terminal where the rulebook was run. The logs show that the rule was triggered and the remediation playbook was run.

2023-06-20 05:26:46,907 - aiohttp.access - INFO - 4.246.213.96 [20/Jun/2023:05:26:46 +0000] "POST /alerts HTTP/1.1" 202 164 "-" "Alertmanager/0.23.0"
2023-06-20 05:26:46,935 - ansible_rulebook.rule_generator - INFO - calling restart web server
2023-06-20 05:26:46,938 - ansible_rulebook.rule_set_runner - INFO - call_action run_playbook
2023-06-20 05:26:46,938 - ansible_rulebook.rule_set_runner - INFO - substitute_variables
2023-06-20 05:26:46,938 - ansible_rulebook.rule_set_runner - INFO - action args: {'name': 'say-what.yml'}
2023-06-20 05:26:46,938 - ansible_rulebook.builtin - INFO - running Ansible playbook: say-what.yml
2023-06-20 05:26:46,942 - ansible_rulebook.builtin - INFO - ruleset: Automatic Remediation of a webserver, rule: restart web server
2023-06-20 05:26:46,942 - ansible_rulebook.builtin - INFO - Calling Ansible runner
2023-06-20 05:26:46,943 - aiohttp.access - INFO - 4.246.213.96 [20/Jun/2023:05:26:46 +0000] "POST /alerts HTTP/1.1" 202 164 "-" "Alertmanager/0.23.0"

Copy snippet

5. Containerize Event-Driven Ansible functionality

You can also create containers on Event-Driven Ansible using the following Containerfile:

FROM registry.access.redhat.com/ubi9-minimal
RUN microdnf install java-17 python3 gcc python3-devel -y && microdnf clean all && python -m ensurepip --upgrade && pip3 install ansible ansible-rulebook asyncio aiokafka aiohttp aiosignal
ENV JAVA_HOME="/usr/lib/jvm/jre-17"
RUN mkdir /eda-ansible
RUN ansible-galaxy collection install ansible.eda
WORKDIR /eda-ansible
COPY. /eda-ansible
CMD ansible-rulebook -i inventory --rulebook ansible-rulebook.yaml  --verbose

Copy snippet

Integrate Event-Driven Ansible in the podman-compose file. Compose files will create a container and run it.

version: '3'
services:
  event-driven:
  build:.
  ports:
    - 5000:5000
  depends_on:
    - prometheus
    - alertmanager

Copy snippet

To run Prometheus, Alertmanager, and the event-driven rulebook, use the following:

podman-compose up -d

Copy snippet

You will get the same results.

Continue your automation journey with Ansible Automation Platform

Get started with Ansible Automation Platform by exploring interactive hands-on labs. Download Ansible Automation Platform at no cost and begin your automation journey.

5 steps to build a self-healing server with Alertmanager

Share:

Prerequisites

The concepts of event driven and self healing

1. Install Prometheus and Alertmanager

2. Write the rulebook

3. Run Ansible Rulebook

4. Trigger the alerts of Alertmanager

5. Containerize Event-Driven Ansible functionality

Continue your automation journey with Ansible Automation Platform

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue