In today's fast-paced world, server downtime can have severe consequences for businesses. Ensuring high availability and rapid recovery is essential for maintaining uninterrupted services. In this article, we will explore how to create a self-healing server using the event-driven architecture of Red Hat Ansible Automation Platform and integrate it with Alertmanager for efficient monitoring and alerting.
Prerequisites
- Install Ansible Automation Platform and Ansible Rulebook.
- Podman and podman-compose
- These ports must be opened on the server side: 5000, 9090, 9093, 22.
The concepts of event driven and self healing
The event-driven architecture of Ansible Automation Platform enables servers to respond to events and take predefined actions automatically. It utilizes event-driven automation and monitoring to detect and remediate issues in real time, leading to a self-healing infrastructure.
To learn more about the concept of Event-Driven Ansible, please read my previous article. You can pull the code from our GitHub repository.
1. Install Prometheus and Alertmanager
Launching the containers with podman-compose will enable us to install prometheus and Alertmanager, using the following podman-compose.yaml
file.
version: '3'
services:
prometheus:
image: prom/prometheus:v2.30.3
ports:
- 9090:9090
volumes:
-./prometheus:/etc/prometheus
- prometheus-data:/prometheus
command: --web.enable-lifecycle --config.file=/etc/prometheus/prometheus.yml
alertmanager:
image: prom/alertmanager:v0.23.0
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- "./alertmanager:/config"
- alertmanager-data:/data
command: --config.file=/config/alertmanager.yml --log.level=debug
volumes:
alertmanager-data:
prometheus-data:
The Alertmanager for Event-Driven Ansible should be configured in the receiver section, as shown in the alertmanager.yml
file. Include the IP address of the server from which the rulebook is triggered in the webhook section. For instance, if you are running rulebooks on your local machine, the time IP should be http://192.168.1.65:5000/alerts,
or if you have a remote server, then add the public IP of that server http:// 123.345.9.56:5000/alerts
.
alertmanager.yml
:
route:
group_by: [ alertname ]
receiver: 'EDA' # default receiver
repeat_interval: 24h
routes:
receivers:
- name: 'EDA'
webhook_configs:
- url: 'http://172.123.170.87:5000/alerts'
To launch the file, use the following command:
podman-compose up -d
Check that Alertmanager and Prometheus are running:
podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
254000d2a108 prom/alertmanager:v0.23.0 "/bin/alertmanager -..." 15 seconds ago Up 13 seconds 0.0.0.0:9093->9093/tcp, :::9093->9093/tcp self-healing-server_alertmanager_1
277f1c6da0cd prom/prometheus:v2.30.3 "/bin/prometheus --w..." 15 seconds ago Up 14 seconds 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp self-healing-server_prometheus_1
Search for the http://192.168.1.22:9090
site and check that prometheus is up and running. After accessing Prometheus, check out the Alertmanager dashboard: http://192.168.1.22:9093
2. Write the rulebook
The basic principle of any rulebook is source - rule - action. So the following rulebook also contains the same conditions. For a self-healing use case, we will use rules with conditions to trigger rulebooks for specific conditions.
---
- name: Automatic Remediation of a webserver
hosts: localhost
sources:
- name: listen for alerts
ansible.eda.alertmanager:
host: 0.0.0.0
port: 5000
rules:
- name: server down
condition: event.alert.labels.job == "server" and event.alert.status == "firing"
action:
run_playbook:
name: remediation-playbooks/server-playbook.yml
- name: Storage full on server
condition: event.alert.labels.job == "storage" and event.alert.status == "firing"
action:
run_playbook:
name: remediation-playbooks/storage-playbook.yml
- name: memory full on server
condition: event.alert.labels.job == "memory" and event.alert.status == "firing"
action:
run_playbook:
name: remediation-playbooks/memory-playbook.yml
- name: ssh server down
condition: event.alert.labels.job == "ssh" and event.alert.status == "firing"
action:
run_playbook:
name: remediation-playbooks/ssh-playbook.yml
- name: CPU full on server
condition: event.alert.labels.job == "cpu" and event.alert.status == "firing"
action:
run_playbook:
name: remediation-playbooks/cpu-playbook.yml
For a self-healing server, we must list all conditions or scenarios where we can anticipate the server facing issues like full storage, memory fully utilized, and so on.
Accordingly, we have to find the remediation of that issue and create an Ansible Playbook that resolves that issue after triggering that issue without manual intervention.
Create the inventory file with localhost as host:
localhost
3. Run Ansible Rulebook
Use the ansible-rulebook
command to run the rulebook:
ansible-rulebook --rulebook ansible-rulebook.yaml -i inventory -v
05:13:46,294 - ansible_rulebook.app - INFO - Starting sources
05:13:46,294 - ansible_rulebook.app - INFO - Starting rules
05:13:46,294 - ansible_rulebook.engine - INFO - run_ruleset
05:13:47 496 [main] INFO org.drools.ansible.rulebook.integration.api.rulesengine.AbstractRulesEvaluator - Start automatic pseudo clock with a tick every 100 milliseconds
05:13:48,402 - ansible_rulebook.engine - INFO - load source filters
05:13:48,403 - ansible_rulebook.engine - INFO - loading eda.builtin.insert_meta_info
05:13:48,887 - ansible_rulebook.engine - INFO - Calling main in ansible.eda.alertmanager
05:13:48,890 - ansible_rulebook.engine - INFO - Waiting for all ruleset tasks to end
05:13:48,890 - ansible_rulebook.rule_set_runner - INFO - Waiting for actions on events from Automatic Remediation of a webserver
05:13:48,890 - ansible_rulebook.rule_set_runner - INFO - Waiting for events, ruleset: Automatic Remediation of a webserver
05:13:48 891 [drools-async-evaluator-thread] INFO org.drools.ansible.rulebook.integration.api.io.RuleExecutorChannel - Async channel connected
It will wait for the Alertmanager alert to fire and status changes to firing. Then only the Ansible rule will trigger. Conditions also play a major role. We are using the and functionality of two functions. The first one is for firing status, and the second one is for job label match.
We can assign different labels to different applications and we can trigger the rulebook based on status.
4. Trigger the alerts of Alertmanager
The status of alert changed from stable to firing condition as shown in Figure 1. In labels as a job, we can change the application name and trigger the specific remediation playbook.
Go back to the Ansible Rulebook CLI terminal where the rulebook was run. The logs show that the rule was triggered and the remediation playbook was run.
2023-06-20 05:26:46,907 - aiohttp.access - INFO - 4.246.213.96 [20/Jun/2023:05:26:46 +0000] "POST /alerts HTTP/1.1" 202 164 "-" "Alertmanager/0.23.0"
2023-06-20 05:26:46,935 - ansible_rulebook.rule_generator - INFO - calling restart web server
2023-06-20 05:26:46,938 - ansible_rulebook.rule_set_runner - INFO - call_action run_playbook
2023-06-20 05:26:46,938 - ansible_rulebook.rule_set_runner - INFO - substitute_variables
2023-06-20 05:26:46,938 - ansible_rulebook.rule_set_runner - INFO - action args: {'name': 'say-what.yml'}
2023-06-20 05:26:46,938 - ansible_rulebook.builtin - INFO - running Ansible playbook: say-what.yml
2023-06-20 05:26:46,942 - ansible_rulebook.builtin - INFO - ruleset: Automatic Remediation of a webserver, rule: restart web server
2023-06-20 05:26:46,942 - ansible_rulebook.builtin - INFO - Calling Ansible runner
2023-06-20 05:26:46,943 - aiohttp.access - INFO - 4.246.213.96 [20/Jun/2023:05:26:46 +0000] "POST /alerts HTTP/1.1" 202 164 "-" "Alertmanager/0.23.0"
5. Containerize Event-Driven Ansible functionality
You can also create containers on Event-Driven Ansible using the following Containerfile:
FROM registry.access.redhat.com/ubi9-minimal
RUN microdnf install java-17 python3 gcc python3-devel -y && microdnf clean all && python -m ensurepip --upgrade && pip3 install ansible ansible-rulebook asyncio aiokafka aiohttp aiosignal
ENV JAVA_HOME="/usr/lib/jvm/jre-17"
RUN mkdir /eda-ansible
RUN ansible-galaxy collection install ansible.eda
WORKDIR /eda-ansible
COPY. /eda-ansible
CMD ansible-rulebook -i inventory --rulebook ansible-rulebook.yaml --verbose
Integrate Event-Driven Ansible in the podman-compose file. Compose files will create a container and run it.
version: '3'
services:
event-driven:
build:.
ports:
- 5000:5000
depends_on:
- prometheus
- alertmanager
To run Prometheus, Alertmanager, and the event-driven rulebook, use the following:
podman-compose up -d
You will get the same results.
Continue your automation journey with Ansible Automation Platform
Get started with Ansible Automation Platform by exploring interactive hands-on labs. Download Ansible Automation Platform at no cost and begin your automation journey.