Semantic anomaly detection in log files with Cordon

Production log files are often long and cluttered, filled with repetitive INFO entries, health checks, and normal operational messages. In a failure scenario, most of these lines will not give you an idea of what might have gone wrong. While verbose logging in production applications is appreciated, we often don't know what we are looking for in these large log files. I created Cordon to help human and AI operators use semantic anomaly detection to identify what is truly unusual.

In this post, I explain how Cordon finds semantically unique events in log files while filtering out the noise.

Repetition is boring, uniqueness is interesting

My experience debugging unfamiliar logs often follows this cycle:

Search logs for errors.
Find one or more errors/stack traces.
Ask a familiar engineer if these errors are the problem.
"Maybe, but we see these errors all the time." Or, "No, this is a known problem. It's something else."

This might not apply to every application, but it holds true for many I have encountered. If an engineer that is unfamiliar with which errors are considered "normal" has a difficult time pinpointing an issue, context windows aside, how are large language models (LLMs) supposed to determine which errors are significant in large log files? They can't.

It might seem counterintuitive, but often a repetitive error can be considered "normal." If the same ERROR appears 10,000 times, it is often just considered background noise or a symptom of a larger problem. The real anomalies in a log file are semantically unique events. They only appear once or twice and are contextually different from everything else. Rather than searching for keywords or counting error frequencies, semantic anomaly detection looks for uniqueness in a transformer model's embedding space. Logs that are semantically similar cluster together; logs that are unusual stand alone.

Finding semantic meaning in logs

My approach to determining which part of the log file is semantically unique starts with embedding. The log file is chunked into "windows" defined by a set number of lines that you provide. Each window should approximately match the max token size of the chosen embedding model. I chose all-MiniLM-L6-v2 as the default model. It has a max token limit of 256, so I try to choose a window size large enough to efficiently process the information without surpassing that limit. Four lines works well for most tests.

After chunking the logs, the system converts each window to a vector that captures its semantic meaning. Without getting too far into the weeds, semantically similar text tends to produce a nearby vector, even if the exact wording differs. This means the uniqueness of a log entry can be measured by its distance from other log entries in this high-dimensional space. For example:

Connection timeout after 30 seconds and Request timed out: no response produce similar embeddings.
OutOfMemoryError: Java heap space produces a very different embedding from connection errors.

Measuring semantic distance: cosine similarity and distance

To quantify how different two log entries are, this approach uses cosine similarity, a measure of the angle between two vectors in embedding space. Imagine two arrows pointing from the origin in vector space. If the arrows point in nearly the same direction (small angle between them), the texts are semantically similar. If they point in very different directions (large angle), the texts have different meanings.

Cosine similarity is calculated as the dot product of two vectors divided by the product of their magnitudes. The result ranges from -1 (opposite) to 1 (identical), though text embeddings typically fall between 0 and 1. A value of 0.95 means the texts are very similar; a value of 0.3 means they're quite different.

For anomaly detection, cosine distance is used instead (1 - cosine_similarity). This flips the scale so that higher values indicate a greater difference.

The detection methodology: k-NN density scoring

For each window embedding, the algorithm finds its k nearest neighbors (default k=5) and calculates the average distance to those neighbors. This is the anomaly score.

Low score (small average distance) = Dense cluster = Normal/repetitive logs
High score (large average distance) = Isolated point = Anomalous logs

Mathematically, the anomaly score S for a window x is:

S(x) = (1/k) × Σ distance(x, neighbor_i)  for i = 1 to k

In plain terms, that means: if k=5, the algorithm finds the five windows most similar to the current one, measures how far away each of them is, and averages those five distances. That average is the anomaly score. A window surrounded by very similar neighbors (small distances) gets a low score. A window with no close matches (large distances to even its nearest neighbors) gets a high score.

Benchmark

To evaluate the approach, I conducted benchmarks using the HDFS v1 dataset from Loghub. This log file contains 11.1 million lines of Hadoop Distributed File System production logs with 575,000 sessions and a 2.93% anomaly rate. This dataset has pre-labeled anomalies and 29 unique event templates (stored in separate files) that are used as a grading rubric.

Rather than using line-level metrics (Precision, Recall, F1), the evaluation focused on template-based metrics that measure the diversity of anomaly types detected. This aligns with the tool's design goal: finding semantically unique patterns, not counting every instance of repetitive errors. The metrics I wanted to measure were:

Template recall: Fraction of unique anomaly types detected
Rare template recall: Detection rate of templates appearing fewer than 100 times

Results and findings

The dominant factor in detection accuracy is sample size. But why?

The HDFS dataset has a specific characteristic that amplifies this effect: it contains only 29 unique event templates across 11 million lines. This low semantic diversity means that small random samples might miss entire template types or capture non-representative distributions. When sampling 50,000 lines from 11 million, which specific portion gets sampled matters enormously—a template that appears only 5 times in the full dataset might not appear at all in a small sample.

This suggests more data genuinely helps because it provides better coverage of the semantic landscape. Also, this variance might be specific to highly structured, repetitive logs like HDFS. Application logs with greater semantic diversity per line, where each line is more likely to be unique, can show better stability at smaller sample sizes because each sample naturally captures more variety.

Testing with 10 runs per configuration (see a more detailed analysis of the results in the project repo):

Sample size	Template recall	Rare template recall	Coefficient of variation
50,000 lines	58.3% ± 19.4%	40.9% ± 31.5%	33.2%
100,000 lines	66.8% ± 13.2%	46.3% ± 31.5%	19.7%
250,000 lines	76.7% ± 16.3%	63.7% ± 31.1%	21.3%
500,000 lines	84.0% ± 12.9%	64.6% ± 33.4%	15.4%
1,000,000 lines	93.7% ± 5.2%	84.4% ± 14.8%	5.6%
5,000,000 lines	96.6%	90.0%	N/A (single run)

With the 2% threshold (anomaly_percentile=0.02), the tool achieves 98% reduction while maintaining high template recall on large samples:

1 million lines → ~20K lines (98% reduction)
5 million lines → ~100K lines (98% reduction)

This makes it practical to reduce massive log files to a size that fits within LLM context windows while preserving the semantically unique content.

Practical applications

This approach is useful in specific scenarios:

LLM pre-processing: Reduce massive logs to just the anomalous sections before sending to an LLM. Instead of hitting context limits, send kilobytes of high-signal content.
Initial triage: When investigating unfamiliar logs, surface what's semantically unusual without knowing what to search for.
Exploratory analysis: Discover unexpected patterns like rare errors, unusual state transitions, and one-off events that would otherwise be buried in noise.

Limitations and trade-offs

This approach is intentionally lossy:

Repetitive errors are filtered: If the same critical error appears 500 times, it will score low (normal) because it clusters with itself. For counting error frequencies, use traditional tools.
Relative, not absolute: The percentile threshold/range is relative to each log file. What's "anomalous" in one log might not be in another.
Sample size matters: On highly structured logs like HDFS, small samples (< 250,000 lines) show high variance. Larger samples produce more stable results.
Not for compliance: Since this filters aggressively, don't use it for compliance logging or when complete audit trails are required.
Not for known issues: If you know what error you're looking for, grep is faster and more precise.

Conclusion

Transformer embeddings, cosine distance, and k-NN density scoring enable semantic anomaly detection that understands meaning, not just keywords. The technique automatically adapts to any log format and surfaces the semantically unique content without the need for pre-defined patterns or configuration.

Cordon is an open source tool and Python library. See the following links for more information:

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Semantic anomaly detection in log files with Cordon

Repetition is boring, uniqueness is interesting

Finding semantic meaning in logs

Measuring semantic distance: cosine similarity and distance

The detection methodology: k-NN density scoring

Benchmark

Results and findings

Practical applications

Limitations and trade-offs

Conclusion

Red Hat trusted libraries - Trust and integrity for your software supply chain

GDAL 3.4 package: Full-featured GIS functionality on RHEL

Red Hat OpenShift Service on AWS with hosted control planes enables configuration of cluster monitoring operator for additional observability

How hosted control planes are getting smarter about resource management

Fine-tune AI pipelines in Red Hat OpenShift AI 3.3

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue