Extraction Accuracy

These figures come from our internal evaluation harness, not from a third-party audit of live customer contracts. The harness scores extraction against curated fixtures (labeled clauses, multi-clause documents, adversarial examples, and a small set of held-out real excerpts). CI enforces baselines so regressions are caught before release. Low-confidence candidates still require human review before anything is tracked in the product.

Last validated: 2026-03-29 (run python -m app.eval from the backend to reproduce)

Regression suite (labeled + gold + adversarial)

This is the combined set we use to block shipping obvious extraction regressions. It does not prove perfect accuracy on every contract in the wild; it proves the engine matches our labeled expectations on these cases.

Metric	Value
Overall	Precision 1.00, recall 1.00, F1 1.00
Per type (primary)	renewal, termination_notice, payment_term: 1.00 / 1.00 / 1.00
Examples passed	88/88
False positives / false negatives	0 / 0 (on this suite)
Document-level	Documents with FP: 0, with FN: 0; avg precision and recall per document: 1.00
Confidence calibration	medium_high and high bands: precision 1.00 (on candidates scored in those bands in this suite)

Held-out real-document fixtures

Two full-document excerpts live in held_out_real_documents.json. We evaluate them separately from the regression suite so they act as a small generalization check rather than something the model can overfit.

Metric	Value
Documents	2 (`held_out_real_documents.json`)
Latest run (held-out only)	2 wrong-type false positives (extractions that did not match expected obligation types); 0 false negatives on expected obligations; recall 1.00 on labeled expectations
Gold document fixtures	2 multi-clause examples in `gold_documents.json` (included in the 88 regression examples above)

Scope

Metrics above apply to primary obligation types: renewal clauses, termination notice periods, and payment terms. Supported ingest formats include PDF, DOCX, TXT, HTML, XLSX, and images. OCR is used for scanned pages; confidence is often lower on poor scans, which pushes more work to review.

Methodology

We label expected extractions per fixture, run the same extractor the product uses, and compute precision, recall, and F1 by obligation type. Adversarial cases cover ambiguous wording, nested clauses, and lookalike text. When the full suite is combined with held-out files in one run, overall precision drops slightly because of those held-out false positives; that is why we report the regression suite and the held-out slice separately. For implementation detail, see backend/app/eval/README.md in the repository.