Benchmarks
Same datasets the academic log-parsing literature uses (Hadoop, HDFS, Spark via LogHub-2.0). Compared head-to-head with LibreLog (Llama-3-8B baseline). All numbers reproducible from the open eval harness.
Compression
Hallucination
Warm latency
Per million lines
What we measure
Every line you don't have to send to an LLM is money saved and context preserved. Higher is better. gzip ≈ 6×, severity ≈ 3×, codag > 40× lossy.
On labeled incidents: did the trigger and root-cause lines actually survive? Compression without recall is just deletion. Measured per role: trigger, root_cause, evidence.
Wall-clock per incident, end to end. Includes network round-trip for hosted baselines, raw compute for local ones. Reported as p50 and p95.
Codag vs. LibreLog (Llama-3-8B)
5 wins · 1 tie · 2 losses · 5× smaller model, trained in 5h on a Mac
| Dataset | Metric | LibreLog | Codag | Δ |
|---|---|---|---|---|
| Hadoop | FTA | 0.702 | 0.753 | +5.1pp |
| Hadoop | FGA | 0.901 | 0.938 | +3.7pp |
| HDFS | PA | 0.918 | 0.988 | +7.0pp |
| HDFS | FTA | 0.777 | 0.800 | +2.3pp |
| HDFS | GA | 1.000 | 1.000 | tie |
| Spark | FGA | 0.936 | 0.978 | +4.2pp |
Evidence recall vs. prior art
On labeled incident windows. Higher = more diagnostic lines preserved.
| Approach | Evidence recall | Trigger recall | Compression |
|---|---|---|---|
| Severity filter | 0.30 | 0.12 | 3.4× |
| Drain3 templating | 0.29 | 0.18 | 9.9× |
| TF-IDF anomaly | 0.15 | 0.08 | 14.2× |
| Codag | 0.619 | 0.667 | 7.4× · 45.6× compact |
Reproducible from the open repo: Drain3, gzip / lz4, severity filter, Claude Opus 4.6, and codag (with an API key). The LibreLog comparison requires their published model weights and is not bundled — see codag/codag-log-bench for the harness.
Reproduce
One repo, four open-source baselines + the codag API. LogHub-2.0 fetched on demand, ~30 hand-labeled incidents bundled. Results land in results/latest.json.
$ git clone https://github.com/codag/codag-log-bench $ cd codag-log-bench && bash scripts/download_loghub.sh $ CODAG_API_KEY=cdk_… python -m codag_log_bench.run --baselines all
Want to run the benchmarks against your own log corpus?
Get in touch