Benchmarks — Codag

45.6×

Compression

Hallucination

<1s

Warm latency

$0.10

Per million lines

What we measure

Three axes, defined precisely. No vibes-based comparisons.

Axis · 01

Compression

ratio = raw_bytes / output_bytes

Every line you don't have to send to an LLM is money saved and context preserved. Higher is better. gzip ≈ 6×, severity ≈ 3×, codag > 40× lossy.

Axis · 02

Recall accuracy

recall = |kept ∩ truth| / |truth|

On labeled incidents: did the trigger and root-cause lines actually survive? Compression without recall is just deletion. Measured per role: trigger, root_cause, evidence.

Axis · 03

Speed

latency = p50, p95 ms · throughput

Wall-clock per incident, end to end. Includes network round-trip for hosted baselines, raw compute for local ones. Reported as p50 and p95.

Codag vs. LibreLog (Llama-3-8B)

5 wins · 1 tie · 2 losses · 5× smaller model, trained in 5h on a Mac

Dataset	Metric	LibreLog	Codag	Δ
Hadoop	FTA	0.702	0.753	+5.1pp
Hadoop	FGA	0.901	0.938	+3.7pp
HDFS	PA	0.918	0.988	+7.0pp
HDFS	FTA	0.777	0.800	+2.3pp
HDFS	GA	1.000	1.000	tie
Spark	FGA	0.936	0.978	+4.2pp

Evidence recall vs. prior art

On labeled incident windows. Higher = more diagnostic lines preserved.

Approach	Evidence recall	Trigger recall	Compression
Severity filter	0.30	0.12	3.4×
Drain3 templating	0.29	0.18	9.9×
TF-IDF anomaly	0.15	0.08	14.2×
Codag	0.619	0.667	7.4× · 45.6× compact

Reproducible from the open repo: Drain3, gzip / lz4, severity filter, Claude Opus 4.6, and codag (with an API key). The LibreLog comparison requires their published model weights and is not bundled — see codag/codag-log-bench for the harness.

Reproduce

Run the benchmarks yourself.

One repo, four open-source baselines + the codag API. LogHub-2.0 fetched on demand, ~30 hand-labeled incidents bundled. Results land in results/latest.json.

$ git clone https://github.com/codag/codag-log-bench
$ cd codag-log-bench && bash scripts/download_loghub.sh
$ CODAG_API_KEY=cdk_… python -m codag_log_bench.run --baselines all

View on GitHub MIT-licensed · CI-tested · drop-in

Measured against the published baselines.

Three axes, defined precisely. No vibes-based comparisons.

Compression

Recall accuracy

Speed

Run the benchmarks yourself.