Production log files are often long and cluttered, filled with repetitive INFO entries, health checks, and normal operational messages. In a failure scenario, most of these lines will not give you an idea of what might have gone wrong. While verbose logging in production applications is appreciated, we often don't know what we are looking for in these large log files. I created Cordon to help human and AI operators use semantic anomaly detection to identify what is truly unusual.
In this post, I explain how Cordon finds semantically unique events in log files while filtering out the noise.
Repetition is boring, uniqueness is interesting
My experience debugging unfamiliar logs often follows this cycle:
- Search logs for errors.
- Find one or more errors/stack traces.
- Ask a familiar engineer if these errors are the problem.
- "Maybe, but we see these errors all the time." Or, "No, this is a known problem. It's something else."
This might not apply to every application, but it holds true for many I have encountered. If an engineer that is unfamiliar with which errors are considered "normal" has a difficult time pinpointing an issue, context windows aside, how are large language models (LLMs) supposed to determine which errors are significant in large log files? They can't.
It might seem counterintuitive, but often a repetitive error can be considered "normal." If the same ERROR appears 10,000 times, it is often just considered background noise or a symptom of a larger problem. The real anomalies in a log file are semantically unique events. They only appear once or twice and are contextually different from everything else. Rather than searching for keywords or counting error frequencies, semantic anomaly detection looks for uniqueness in a transformer model's embedding space. Logs that are semantically similar cluster together; logs that are unusual stand alone.
Finding semantic meaning in logs
My approach to determining which part of the log file is semantically unique starts with embedding. The log file is chunked into "windows" defined by a set number of lines that you provide. Each window should approximately match the max token size of the chosen embedding model. I chose all-MiniLM-L6-v2 as the default model. It has a max token limit of 256, so I try to choose a window size large enough to efficiently process the information without surpassing that limit. Four lines works well for most tests.
After chunking the logs, the system converts each window to a vector that captures its semantic meaning. Without getting too far into the weeds, semantically similar text tends to produce a nearby vector, even if the exact wording differs. This means the uniqueness of a log entry can be measured by its distance from other log entries in this high-dimensional space. For example:
Connection timeout after 30 secondsandRequest timed out: no responseproduce similar embeddings.OutOfMemoryError: Java heap spaceproduces a very different embedding from connection errors.
Measuring semantic distance: cosine similarity and distance
To quantify how different two log entries are, this approach uses cosine similarity, a measure of the angle between two vectors in embedding space. Imagine two arrows pointing from the origin in vector space. If the arrows point in nearly the same direction (small angle between them), the texts are semantically similar. If they point in very different directions (large angle), the texts have different meanings.
Cosine similarity is calculated as the dot product of two vectors divided by the product of their magnitudes. The result ranges from -1 (opposite) to 1 (identical), though text embeddings typically fall between 0 and 1. A value of 0.95 means the texts are very similar; a value of 0.3 means they're quite different.
For anomaly detection, cosine distance is used instead (1 - cosine_similarity). This flips the scale so that higher values indicate a greater difference.
The detection methodology: k-NN density scoring
For each window embedding, the algorithm finds its k nearest neighbors (default k=5) and calculates the average distance to those neighbors. This is the anomaly score.
- Low score (small average distance) = Dense cluster = Normal/repetitive logs
- High score (large average distance) = Isolated point = Anomalous logs
Mathematically, the anomaly score S for a window x is:
S(x) = (1/k) × Σ distance(x, neighbor_i) for i = 1 to kIn plain terms, that means: if k=5, the algorithm finds the five windows most similar to the current one, measures how far away each of them is, and averages those five distances. That average is the anomaly score. A window surrounded by very similar neighbors (small distances) gets a low score. A window with no close matches (large distances to even its nearest neighbors) gets a high score.
Benchmark
To evaluate the approach, I conducted benchmarks using the HDFS v1 dataset from Loghub. This log file contains 11.1 million lines of Hadoop Distributed File System production logs with 575,000 sessions and a 2.93% anomaly rate. This dataset has pre-labeled anomalies and 29 unique event templates (stored in separate files) that are used as a grading rubric.
Rather than using line-level metrics (Precision, Recall, F1), the evaluation focused on template-based metrics that measure the diversity of anomaly types detected. This aligns with the tool's design goal: finding semantically unique patterns, not counting every instance of repetitive errors. The metrics I wanted to measure were:
- Template recall: Fraction of unique anomaly types detected
- Rare template recall: Detection rate of templates appearing fewer than 100 times
Results and findings
The dominant factor in detection accuracy is sample size. But why?
The HDFS dataset has a specific characteristic that amplifies this effect: it contains only 29 unique event templates across 11 million lines. This low semantic diversity means that small random samples might miss entire template types or capture non-representative distributions. When sampling 50,000 lines from 11 million, which specific portion gets sampled matters enormously—a template that appears only 5 times in the full dataset might not appear at all in a small sample.
This suggests more data genuinely helps because it provides better coverage of the semantic landscape. Also, this variance might be specific to highly structured, repetitive logs like HDFS. Application logs with greater semantic diversity per line, where each line is more likely to be unique, can show better stability at smaller sample sizes because each sample naturally captures more variety.
Testing with 10 runs per configuration (see a more detailed analysis of the results in the project repo):
| Sample size | Template recall | Rare template recall | Coefficient of variation |
|---|---|---|---|
| 50,000 lines | 58.3% ± 19.4% | 40.9% ± 31.5% | 33.2% |
| 100,000 lines | 66.8% ± 13.2% | 46.3% ± 31.5% | 19.7% |
| 250,000 lines | 76.7% ± 16.3% | 63.7% ± 31.1% | 21.3% |
| 500,000 lines | 84.0% ± 12.9% | 64.6% ± 33.4% | 15.4% |
| 1,000,000 lines | 93.7% ± 5.2% | 84.4% ± 14.8% | 5.6% |
| 5,000,000 lines | 96.6% | 90.0% | N/A (single run) |
With the 2% threshold (anomaly_percentile=0.02), the tool achieves 98% reduction while maintaining high template recall on large samples:
- 1 million lines → ~20K lines (98% reduction)
- 5 million lines → ~100K lines (98% reduction)
This makes it practical to reduce massive log files to a size that fits within LLM context windows while preserving the semantically unique content.
Practical applications
This approach is useful in specific scenarios:
- LLM pre-processing: Reduce massive logs to just the anomalous sections before sending to an LLM. Instead of hitting context limits, send kilobytes of high-signal content.
- Initial triage: When investigating unfamiliar logs, surface what's semantically unusual without knowing what to search for.
- Exploratory analysis: Discover unexpected patterns like rare errors, unusual state transitions, and one-off events that would otherwise be buried in noise.
Limitations and trade-offs
This approach is intentionally lossy:
- Repetitive errors are filtered: If the same critical error appears 500 times, it will score low (normal) because it clusters with itself. For counting error frequencies, use traditional tools.
- Relative, not absolute: The percentile threshold/range is relative to each log file. What's "anomalous" in one log might not be in another.
- Sample size matters: On highly structured logs like HDFS, small samples (< 250,000 lines) show high variance. Larger samples produce more stable results.
- Not for compliance: Since this filters aggressively, don't use it for compliance logging or when complete audit trails are required.
- Not for known issues: If you know what error you're looking for,
grepis faster and more precise.
Conclusion
Transformer embeddings, cosine distance, and k-NN density scoring enable semantic anomaly detection that understands meaning, not just keywords. The technique automatically adapts to any log format and surfaces the semantically unique content without the need for pre-defined patterns or configuration.
Cordon is an open source tool and Python library. See the following links for more information: