Alertmanager Watchdog monitoring with Nagios passive checks

After installing a fresh Red Hat OpenShift cluster, go to Monitoring -> Alerting. There, you will find a Watchdog alert, which sends messages to let you know that Alertmanager is not only still running, but is also emitting other signals for alerts you might be interested in. You can hook into Watchdog alerts with an external monitoring system, which in turn can tell you that alerting in your OpenShift cluster is working.

"You need a check to check if your check checks out."

How do you do this? Before we can configure Alertmanager for sending out Watchdog alerts, we need something on the receiving side, which is in our case Nagios. Follow me on this journey to get Alertmanager's Watchdog alerting against Nagios with a passive check.

Set up Nagios

OpenShift is probably not the first infrastructure element you have running under your supervision. That is why we start to capture a message from OpenShift with a self-made (actually from the Python 3 website and adjusted) Python HTTP receiving server, just to learn how to configure alert manager and to possibly modify the received alert message.

Also, you probably already have Nagios, Checkmk, Zabbix, or something else for external monitoring and running alerts. For this journey, I chose to use Nagios because it is a simple precooked and pre-setup option via yum install nagios. Nagios normally only does active checks. An active check means that Nagios is the initiator of a check configured by you. To know if the OpenShift Alertmanager is working, we need a passive check in Nagios.

So, let's go and let our already existing monitoring system receive something from Alertmanager. Start by installing Nagios and the needed plugins:

$ yum -y install nagios nagios-plugins-ping nagios-plugins-ssh nagios-plugins-http nagios-plugins-swap nagios-plugins-users nagios-plugins-load nagios-plugins-disk nagios-plugins-procs nagios-plugins-dummy

Let's be more secure and change the provided default password for the Nagios administrator, using htpasswd:

$ htpasswd -b /etc/nagios/passwd nagiosadmin <very_secret_password_you_created>

Note: If you also want to change the admin's username nagiosadmin to something else, don't forget to change it also in /etc/nagios/cgi.cfg.

Now, we can enable and start Nagios for the first time:

$ systemctl enable nagios
$ systemctl start nagios

Do not forget that every time you modify your configuration files, you should run a sanity check on them. It is important to do this before you (re)start Nagios Core since it will not start if your configuration contains errors. Use the following to check your Nagios configuration:

$ /sbin/nagios -v /etc/nagios/nagios.cfg
$ systemctl reload nagios
$ systemctl status -l nagios

Dump HTTP POST content to a file

Before we start configuring, we first need an HTTP POST receiver program in order to receive a message from the Alertmanager via a webhook configuration. Alertmanager sends out a JSON message to an HTTP endpoint. To do that, I created a very basic python program to dump all data received via POST into a file:

#!/usr/bin/env python3

from http.server import HTTPServer, BaseHTTPRequestHandler
from io import BytesIO

class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):

def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b'Hello, world!')

def do_POST(self):
content_length = int(self.headers['Content-Length'])
body = self.rfile.read(content_length)
self.send_response(200)
self.end_headers()
response = BytesIO()
response.write(b'This is POST request. ')
response.write(b'Received: ')
response.write(body)
self.wfile.write(response.getvalue())
dump_json = open('/tmp/content.json','w')
dump_json.write(body.decode('utf-8'))
dump_json.close()

httpd = HTTPServer(('localhost', 8000), SimpleHTTPRequestHandler)
httpd.serve_forever()

The above program definitely needs some rework. Both the location and format of the output in the file have to be changed for Nagios.

Configure Nagios for a passive check

Now that this rudimentary receive program is in place, let's configure the passive checks in Nagios. I added a dummy command to the file /etc/nagios/objects/commands.cfg. That is what I understood from the Nagios documentation, but it is not really clear to me whether that is the right place and the right information. In the end, this process worked for me. But keep following, the purpose at the end is Alertmanager showing up in Nagios.

Add the following to the end of the commands.cfg file:

define command {
command_name check_dummy
command_line $USER1$/check_dummy $ARG1$ $ARG2$
}

Then add this to the server's service object .cfg file:

define service {
use generic-service
host_name box.example.com
service_description OCPALERTMANAGER
notifications_enabled 0
passive_checks_enabled 1
check_interval 15 ; 1.5 times watchdog alerting time
check_freshness 1
check_command check_dummy!2 "Alertmanager FAIL"
}

It would be nice if we could check that this is working via curl, but first, we have to change the sample Python program. It writes to a file by default, and for this example, it must write to a Nagios command_file.

This is the adjusted Python program to write to the command_file with the right service_description:

#!/usr/bin/env python3

from http.server import HTTPServer, BaseHTTPRequestHandler
from io import BytesIO
import time;

class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):

def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b'Hello, world!')

def do_POST(self):
content_length = int(self.headers['Content-Length'])
body = self.rfile.read(content_length)
self.send_response(200)
self.end_headers()
response = BytesIO()
response.write(b'This is POST request. ')
response.write(b'Received: ')
response.write(body)
self.wfile.write(response.getvalue())
msg_string = "[{}] PROCESS_SERVICE_CHECK_RESULT;{};{};{};{}"
datetime = time.time()
hostname = "box.example.com"
servicedesc = "OCPALERTMANAGER"
severity = 0
comment = "OK - Alertmanager Watchdog\n"
cmdFile = open('/var/spool/nagios/cmd/nagios.cmd','w')
cmdFile.write(msg_string.format(datetime, hostname, servicedesc, severity, comment))
cmdFile.close()

httpd = HTTPServer(('localhost', 8000), SimpleHTTPRequestHandler)
httpd.serve_forever()

And with a little curl, we can check that the Python program has a connection with the command_file and that Nagios can read it:

$ curl localhost:8000 -d OK -X POST

Now we only have to trigger the POST action. All of the information sent to Nagios is hard-coded in this Python program. Hard coding this kind of information is really not the best practice, but it got me going for now. At this point, we have an endpoint (SimpleHTTPRequestHandler) to which we can connect Alertmanager via a webhook to an external monitoring system—in this case, Nagios with an HTTP helper program.

Configure the webhook in Alertmanager

To configure the Alertmanager's Watchdog, we have to adjust the secret alertmanager.yml. To get that file out of OpenShift, use the following command:

$ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alertmanager.yaml

global:
  resolve_timeout: 5m
route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default'
  routes:
  - match:
      alertname: 'Watchdog'
    repeat_interval: 5m
    receiver: 'watchdog'
receivers:
- name: 'default'
- name: 'watchdog'
  webhook_configs:
  - url: 'http://nagios.example.com:8000/'

Note: On the Prometheus web page, you can see the possible alert endpoints. As I found out with webhook_config, you should name that file in plural form (webhook_configs) in alertmanager.yml. Also, check out the example provided on the Prometheus GitHub.

To get our new fresh configuration back into OpenShift, execute the following command:

$ oc -n openshift-monitoring create secret generic alertmanager-main --from-file=alertmanager.yaml --dry-run -o=yaml | oc -n openshift-monitoring replace secret --filename=-

In the end, you will see something similar received by Nagios. Actually, this is the message the Watchdog sends, via webhook_config, to Nagios:

{"receiver":"watchdog",
"status":"firing",
"alerts":[
{"status":"firing",
"labels":
{"alertname":"Watchdog",
"prometheus":"openshift-monitoring/k8s",
"severity":"none"},
"annotations":
{"message":"This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"},
"startsAt":"2020-03-26T10:57:30.163677339Z",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"https://prometheus-k8s-openshift-monitoring.apps.box.example.com/graph?g0.expr=vector%281%29\u0026g0.tab=1",
"fingerprint":"e25963d69425c836"}],
"groupLabels":{},
"commonLabels":
{"alertname":"Watchdog",
"prometheus":"openshift-monitoring/k8s",
"severity":"none"},
"commonAnnotations":
{"message":"This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"},
"externalURL":"https://alertmanager-main-openshift-monitoring.apps.box.example.com",
"version":"4",
"groupKey":"{}/{alertname=\"Watchdog\"}:{}"}

In the end, if all went well you see in Nagios the services overview a nice green 'OCPALERTMANEGER' service

If you want to catch up with Nagios passive checks, read more at Nagios Core Passive Checks.

Thanks for joining me on this journey!

Last updated: June 29, 2020

Alertmanager Watchdog monitoring with Nagios passive checks

Set up Nagios

Dump HTTP POST content to a file

Configure Nagios for a passive check

Configure the webhook in Alertmanager

Simplify GitOps workflows with MCP in OpenShift Lightspeed

Operationalize AI agents with OpenShift and Kubernetes primitives

Architect an open blueprint for cloud-native AI agents

Computer use: How AI agents can automate almost anything

PyTorch distributed is changing and TorchComms is why

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links