Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Debug Ansible errors faster with an AI monitoring agent

Debug Ansible errors faster with an agent that monitors logs and suggests fixes in real time

February 10, 2026
Itay Katav
Related topics:
Artificial intelligenceAutomation and managementObservability
Related products:
Red Hat AIRed Hat Ansible Automation Platform

    As the number of Ansible playbooks increases, and each playbook includes many individual tasks, both the overall log volume and the number of errors rise significantly.

    To resolve these errors, analysts identify the failed playbook run and route the issue to an authorized specialist. That specialist then determines the best resolution based on business criteria.

    This manual process is slow, error-prone, and fundamentally does not scale as automation grows.

    This blog explores Ansible log monitoring, an AI quickstart designed to speed up the resolution process. Red Hat AI quickstarts are demo applications that show real-world AI use cases.

    How this solution automates the Ansible error resolution workflow

    This solution provides an automated pipeline that eliminates the manual steps described earlier. The system provides the following capabilities:

    • Continuous log ingestion: Ingests and parses Ansible playbook logs in real time.
    • Error detection and alerting: Detects failed tasks and generates alerts.
    • Role-based error routing: Routes errors to the correct specialist based on their authorization level.
    • Automated solution generation: Generates step-by-step solutions using cluster logs and a private knowledge base.
    • Continuous improvement and observability: An evaluation system and observability tools to understand and improve the entire process.
    • User interface: Displays the suggested solution and other relevant metrics in an intuitive way.
    A high-level architecture diagram showing log ingestion from AAP servers into Loki, which triggers Grafana Loki alerts to an agentic workflow that generates solutions using cluster logs and a private knowledge base.
    Figure 1: High-level architecture.

    Ansible log ingestion

    To ingest the logs, you need the following items:

    • A location to store the executed Ansible playbook logs
    • A service that ingests the logs and transforms them into a specific format
    • A log database

    This post uses Red Hat Ansible Automation Platform to store playbook executions. Ansible Automation Platform helps define, manage, and execute Ansible automation.

    To ingest these logs in real time, we use Alloy, an ingestion and aggregation tool. You can point Alloy to the Ansible Automation Platform service to ingest each job's stdout.

    During ingestion, regular expressions (regex) are used to define each entry and label.

    This following example shows an Ansible playbook execution:

    TASK [<task_name> : <task_description>]
    Monday 04 August 2025  07:52:22 +0000 (0:00:01.388)       1:12:32.151 ********* 
    ok: [host_name]
    TASK [<task_name> : <task_description>]
    Monday 04 August 2025  07:52:22 +0000 (0:00:00.019)       1:12:32.170 ********* 
    failed: [host_name] <error message>

    First, split each playbook into separate task entries. You can do this using the regex \n\n.

    Next, assign metadata labels to each log entry. For example, you can label tasks based on their status (ok, failed, or fatal) using a regular expression like this:

    "(?P<status>ok|failed|fatal):\\s+\\[(?P<host>[^\\]]+)\\]"

    After ingesting and labeling the logs, the next step is to store them and make them queryable. For this, we use Loki: a database designed for storing logs and querying them by labels. For example, you can filter with the status = ‘failed’label using LogQL or apply regex in the query as an additional search mechanism.

    After parsing the log file into individual TASK entries, labeling them, and storing them, let’s look at how to handle those error logs.

    Agentic workflow

    Once the error logs are stored in the Loki database. Let's have a look at Figure 2 and dive in step by step.

    An agentic workflow diagram showing error logs processed through a sentence encoder and clustering, followed by a step-by-step router that uses an agent and retrieval-augmented generation (RAG) to store solutions in PostgreSQL.
    Figure 2: Agentic workflow solution.

    Step 1: Define the error template

    Many logs are generated from the same log template. To group them, each log was embedded using a pre-trained sentence encoder. The resulting embeddings were then clustered by training a clustering model from scratch. Each cluster represents a log template.

    For example, consider the following three logs:

    1. error: user id 10 already exists.
    2. error: user id 15 already exists.
    3. error: password of user itayk is wrong.

    Looking at the logs above, logs 1 and 2 are generated from the log template:

    error: user id <user_id> already exists.

    Grouping by template prevents the system from generating duplicate solutions for the same error.

    Step 2: Summary and expert classification

    The user interface includes a summary and classification by authorization for each log template. Users can filter errors based on their authorization level. For example, an analyst with AWS authentication can filter by their authorization to view only error summaries related to AWS.

    Step 3: Creating a step-by-step solution

    We use a router to determine if we need more context to solve the problem or whether the log error alone is enough to generate the step-by-step solution. If the system requires more context, we will spin up an agent that collects context using the following tools:

    Loki Model Context Protocol (MCP)

    The Loki MCP server enriches log analysis by querying the Loki log database to fetch additional log context. The available MCP tools include:

    • get_play_recap: Returns the play recap and profile tasks for the executed job, providing a high-level summary of the TASKs executed by the playbook.

      PLAY RECAP *********************************************************************
      <host_name>                : ok=<ok_count>   changed=<changed_count>   unreachable=<unreachable_count>   failed=<failed_count>   skipped=<skipped_count>   rescued=<rescued_count>   ignored=<ignored_count>   
      =============================================================================== 
      <role_or_task_group> : <task_description> ------------------------------- <duration>
      <role_or_task_group> : <task_description> ------------------------------- <duration>
      ...
    • get_lines_above: Retrieves a set number of lines that occurred before a specific log line.
    • get_logs_by_file_name: Retrieves logs from a particular file with time ranges relative to a reference timestamp.
    • search_logs_by_text: Searches for logs containing specific text within a file.

      This project uses an enhanced fork with additional query capabilities.

    Knowledge base retrieval

    To ensure the agent uses historical expertise when generating solutions, we maintain a knowledge base of recurring problems and their solutions. Analysts create and annotate these, and the agent can retrieve them using retrieval-augmented generation (RAG) as needed.

    Analysts can add recurring problems and solutions over time. This solution includes a basic knowledge base you can easily extend.

    Step 4: Store

    Store the generated results for each log in a PostgreSQL database as a payload.

    Training and inference stages

    Our algorithm uses a batch training approach, retraining the clustering model to determine the templates at set intervals, such as every night.

    At inference time, the system uses the latest trained clustering algorithm.

    Observability and user interfaces

    So we generated suggestions for solving errors, but how can we tell when the agent is failing? Identifying where the agent fails will help you adjust its behavior to meet expectations.

    Phoenix tracing observability

    To view each agentic step, we can use a tracing mechanism that shows all the agent's intermediate inputs and outputs. We chose Phoenix, an open source solution for monitoring and tracing (Figure 3).

    A hierarchical tree in the Phoenix interface displays a LangGraph trace for a log entry. The trace details nested nodes like classify_log_node and get_more_context_node, while the output pane shows a JSON log entry with a fatal status and a connection refusal error.
    Figure 3: View of Phoenix tracing.

    This way, you can see the input and output of each step, understand exactly where the agent is failing in a specific case.

    Annotation interface: Annotation and evaluation

    The annotation interface provides a dedicated workspace designed for reviewing and assessing the system's log-processing results. It presents the original input log entry, the system's generated outputs, and the evaluation results side-by-side. This layout enables domain experts to thoroughly evaluate the system's behavior.

    Through this interface, specialists can:

    • Define the correct (golden) expected output for each error log
    • Provide detailed feedback on the system's generated result
    • Assess if the agent needed additional context
    • Indicate whether any important context was missing and/or suggest better/more appropriate context that should have been retrieved

    See Figure 4 for an example of the human interface in action.

    An interactive interface showing a golden solution for an Ansible playbook failure with root cause analysis, a two-step CLI recovery process, and associated log labels in JSON format.
    Figure 4: Annotation interface feedback view.

    After defining the expected output, run an evaluation to see how the agentic output aligns with the specialist golden output (Figure 5).

    The evaluation view of the annotation interface comparing AI-generated outputs against a golden solution, featuring a root cause analysis for a CloudFormation error, step-by-step CLI remediation, and automated evaluation results marked as passed.
    Figure 5: Evaluation and generated output for entry.

    This process lets you define expectations for the system, run evaluations, identify failures, and improve the workflow to get better results.

    Analyst interface

    Analysts no longer need to manually check Ansible job logs. By connecting this agent to the Ansible Automation Platform cluster, you can use the Analyst interface instead. In the interface, the analyst selects a specialty/expertise, as shown in Figure 6.

    The Ansible Logs Viewer interface with an open Expert Class dropdown menu listing technical personas such as Cloud Infrastructure Engineers, Kubernetes Admins, and DevOps Engineers.
    Figure 6: Expert and label selection in the UI.

    After selecting a specialty, the analyst can view error summaries they are authorized to resolve (Figure 7).

    Interactive Log Viewer displaying four categorized error templates for AWS engineers, including variable errors in environment fields and EC2 instance capacity failures.
    Figure 7: Example of error templates for the AWS specialist.

    When you click a summary, the UI shows the generated step-by-step solution, timestamps, and labels (Figure 8).

    Expanded view of a variable error solution showing a full log message, a root cause analysis for an undefined AWS access key, and a two-step CLI remediation process.
    Figure 8: Generated step-by-step solution for Ansible error.

    Wrap up

    This article explored the Ansible log monitoring agent AI quickstart, which helps you resolve Ansible execution errors faster.

    We covered how the agentic system creates contextual, step-by-step solutions for Ansible errors. We also explained how to use an annotation interface to define expected (golden) outputs and an evaluation mechanism to measure agent performance against those standards.

    Try it yourself: Accelerate Ansible troubleshooting with intelligent log analysis

    Related Posts

    • AI quickstart: Self-service agent for IT process automation

    • Transform complex metrics into actionable insights with this AI quickstart

    • AI meets you where you are: Slack, email & ServiceNow

    • Understanding the recommender system's two-tower model

    • How to build an AI-driven product recommender with OpenShift AI

    Recent Posts

    • Debug Ansible errors faster with an AI monitoring agent

    • Leverage AI for root-cause analysis with MCP servers in VS Code and Cursor

    • How to integrate Developer Hub with OpenShift GitOps

    • AI meets you where you are: Slack, email & ServiceNow

    • Run Voxtral Mini 4B Realtime on vLLM with Red Hat AI on Day 1: A step-by-step guide

    What’s up next?

    Explore a variety of innovative AI use cases from our community. These simple, focused examples have been designed for fast and easy deployment on the Red Hat AI platform.

    Explore AI quickstarts
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue