Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Semantic anomaly detection in log files with Cordon

Detect unusual patterns in logs with embeddings

December 9, 2025
Caleb Evans
Related topics:
Artificial intelligenceAutomation and managementData ScienceDeveloper ProductivityOpen source
Related products:
Red Hat AI

Production log files are often long and cluttered, filled with repetitive INFO entries, health checks, and normal operational messages. In a failure scenario, most of these lines will not give you an idea of what might have gone wrong. While verbose logging in production applications is appreciated, we often don't know what we are looking for in these large log files. I created Cordon to help human and AI operators use semantic anomaly detection to identify what is truly unusual.

In this post, I explain how Cordon finds semantically unique events in log files while filtering out the noise.

Repetition is boring, uniqueness is interesting

My experience debugging unfamiliar logs often follows this cycle:

  1. Search logs for errors.
  2. Find one or more errors/stack traces.
  3. Ask a familiar engineer if these errors are the problem.
  4. "Maybe, but we see these errors all the time." Or, "No, this is a known problem. It's something else."

This might not apply to every application, but it holds true for many I have encountered. If an engineer that is unfamiliar with which errors are considered "normal" has a difficult time pinpointing an issue, context windows aside, how are large language models (LLMs) supposed to determine which errors are significant in large log files? They can't.

It might seem counterintuitive, but often a repetitive error can be considered "normal." If the same ERROR appears 10,000 times, it is often just considered background noise or a symptom of a larger problem. The real anomalies in a log file are semantically unique events. They only appear once or twice and are contextually different from everything else. Rather than searching for keywords or counting error frequencies, semantic anomaly detection looks for uniqueness in a transformer model's embedding space. Logs that are semantically similar cluster together; logs that are unusual stand alone.

Finding semantic meaning in logs

My approach to determining which part of the log file is semantically unique starts with embedding. The log file is chunked into "windows" defined by a set number of lines that you provide. Each window should approximately match the max token size of the chosen embedding model. I chose all-MiniLM-L6-v2 as the default model. It has a max token limit of 256, so I try to choose a window size large enough to efficiently process the information without surpassing that limit. Four lines works well for most tests.

After chunking the logs, the system converts each window to a vector that captures its semantic meaning. Without getting too far into the weeds, semantically similar text tends to produce a nearby vector, even if the exact wording differs. This means the uniqueness of a log entry can be measured by its distance from other log entries in this high-dimensional space. For example:

  • Connection timeout after 30 seconds and Request timed out: no response produce similar embeddings.
  • OutOfMemoryError: Java heap space produces a very different embedding from connection errors.

Measuring semantic distance: cosine similarity and distance

To quantify how different two log entries are, this approach uses cosine similarity, a measure of the angle between two vectors in embedding space. Imagine two arrows pointing from the origin in vector space. If the arrows point in nearly the same direction (small angle between them), the texts are semantically similar. If they point in very different directions (large angle), the texts have different meanings.

Cosine similarity is calculated as the dot product of two vectors divided by the product of their magnitudes. The result ranges from -1 (opposite) to 1 (identical), though text embeddings typically fall between 0 and 1. A value of 0.95 means the texts are very similar; a value of 0.3 means they're quite different.

For anomaly detection, cosine distance is used instead (1 - cosine_similarity). This flips the scale so that higher values indicate a greater difference.

The detection methodology: k-NN density scoring

For each window embedding, the algorithm finds its k nearest neighbors (default k=5) and calculates the average distance to those neighbors. This is the anomaly score.

  • Low score (small average distance) = Dense cluster = Normal/repetitive logs
  • High score (large average distance) = Isolated point = Anomalous logs

Mathematically, the anomaly score S for a window x is:

S(x) = (1/k) × Σ distance(x, neighbor_i)  for i = 1 to k

In plain terms, that means: if k=5, the algorithm finds the five windows most similar to the current one, measures how far away each of them is, and averages those five distances. That average is the anomaly score. A window surrounded by very similar neighbors (small distances) gets a low score. A window with no close matches (large distances to even its nearest neighbors) gets a high score.

Benchmark

To evaluate the approach, I conducted benchmarks using the HDFS v1 dataset from Loghub. This log file contains 11.1 million lines of Hadoop Distributed File System production logs with 575,000 sessions and a 2.93% anomaly rate. This dataset has pre-labeled anomalies and 29 unique event templates (stored in separate files) that are used as a grading rubric.

Rather than using line-level metrics (Precision, Recall, F1), the evaluation focused on template-based metrics that measure the diversity of anomaly types detected. This aligns with the tool's design goal: finding semantically unique patterns, not counting every instance of repetitive errors. The metrics I wanted to measure were:

  • Template recall: Fraction of unique anomaly types detected
  • Rare template recall: Detection rate of templates appearing fewer than 100 times

Results and findings

The dominant factor in detection accuracy is sample size. But why?

The HDFS dataset has a specific characteristic that amplifies this effect: it contains only 29 unique event templates across 11 million lines. This low semantic diversity means that small random samples might miss entire template types or capture non-representative distributions. When sampling 50,000 lines from 11 million, which specific portion gets sampled matters enormously—a template that appears only 5 times in the full dataset might not appear at all in a small sample.

This suggests more data genuinely helps because it provides better coverage of the semantic landscape. Also, this variance might be specific to highly structured, repetitive logs like HDFS. Application logs with greater semantic diversity per line, where each line is more likely to be unique, can show better stability at smaller sample sizes because each sample naturally captures more variety.

Testing with 10 runs per configuration (see a more detailed analysis of the results in the project repo):

Sample sizeTemplate recallRare template recallCoefficient of variation
50,000 lines58.3% ± 19.4%40.9% ± 31.5%33.2%
100,000 lines66.8% ± 13.2%46.3% ± 31.5%19.7%
250,000 lines76.7% ± 16.3%63.7% ± 31.1%21.3%
500,000 lines84.0% ± 12.9%64.6% ± 33.4%15.4%
1,000,000 lines93.7% ± 5.2%84.4% ± 14.8%5.6%
5,000,000 lines96.6%90.0%N/A (single run)

With the 2% threshold (anomaly_percentile=0.02), the tool achieves 98% reduction while maintaining high template recall on large samples:

  • 1 million lines → ~20K lines (98% reduction)
  • 5 million lines → ~100K lines (98% reduction)

This makes it practical to reduce massive log files to a size that fits within LLM context windows while preserving the semantically unique content.

Practical applications

This approach is useful in specific scenarios:

  • LLM pre-processing: Reduce massive logs to just the anomalous sections before sending to an LLM. Instead of hitting context limits, send kilobytes of high-signal content.
  • Initial triage: When investigating unfamiliar logs, surface what's semantically unusual without knowing what to search for.
  • Exploratory analysis: Discover unexpected patterns like rare errors, unusual state transitions, and one-off events that would otherwise be buried in noise.

Limitations and trade-offs

This approach is intentionally lossy:

  • Repetitive errors are filtered: If the same critical error appears 500 times, it will score low (normal) because it clusters with itself. For counting error frequencies, use traditional tools.
  • Relative, not absolute: The percentile threshold/range is relative to each log file. What's "anomalous" in one log might not be in another.
  • Sample size matters: On highly structured logs like HDFS, small samples (< 250,000 lines) show high variance. Larger samples produce more stable results.
  • Not for compliance: Since this filters aggressively, don't use it for compliance logging or when complete audit trails are required.
  • Not for known issues: If you know what error you're looking for, grep is faster and more precise.

Conclusion

Transformer embeddings, cosine distance, and k-NN density scoring enable semantic anomaly detection that understands meaning, not just keywords. The technique automatically adapts to any log format and surfaces the semantically unique content without the need for pre-defined patterns or configuration.

Cordon is an open source tool and Python library. See the following links for more information:

  • GitHub repo
  • PyPi
  • In-depth architecture
  • Benchmark methods, results, etc.

Related Posts

  • vLLM Semantic Router: Improving efficiency in AI reasoning

  • LLM Semantic Router: Intelligent request routing for large language models

  • Multilingual semantic-similarity search with Elasticsearch

  • Log retention and pruning in OpenShift Pipelines

  • How to classify Red Hat OpenShift audit logs

  • Debugging Open vSwitch logs: long poll interval, blocked waiting

Recent Posts

  • Integrate OpenShift Gateway API with OpenShift Service Mesh

  • Your AI agents, evolved: Modernize Llama Stack agents by migrating to the Responses API

  • Semantic anomaly detection in log files with Cordon

  • Advancing low‑bit quantization for LLMs: AutoRound x LLM Compressor

  • JBoss EAP XP 6 is here

What’s up next?

Explore the complete machine learning operations (MLOps) pipeline using Red Hat OpenShift AI, storing models in object storage like S3, fetching models in OpenShift AI, and model serving.

Start the learning path
Red Hat Developers logo LinkedIn YouTube Twitter Facebook

Platforms

  • Red Hat AI
  • Red Hat Enterprise Linux
  • Red Hat OpenShift
  • Red Hat Ansible Automation Platform
  • See all products

Build

  • Developer Sandbox
  • Developer tools
  • Interactive tutorials
  • API catalog

Quicklinks

  • Learning resources
  • E-books
  • Cheat sheets
  • Blog
  • Events
  • Newsletter

Communicate

  • About us
  • Contact sales
  • Find a partner
  • Report a website issue
  • Site status dashboard
  • Report a security problem

RED HAT DEVELOPER

Build here. Go anywhere.

We serve the builders. The problem solvers who create careers with code.

Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

Sign me up

Red Hat legal and privacy links

  • About Red Hat
  • Jobs
  • Events
  • Locations
  • Contact Red Hat
  • Red Hat Blog
  • Inclusion at Red Hat
  • Cool Stuff Store
  • Red Hat Summit
© 2025 Red Hat

Red Hat legal and privacy links

  • Privacy statement
  • Terms of use
  • All policies and guidelines
  • Digital accessibility

Report a website issue