Accuracy You Can Trust

Regression-Tested Extraction You Can Audit

Primary obligation types—renewal, termination notice, and payment terms—score 1.00 precision and recall on our 88-example internal regression suite (labeled clauses, gold documents, and adversarial cases). A separate two-document held-out stress test still surfaces occasional type mismatches; human review and confidence routing cover real-world variance.

88-example regression suite, CI-gated

Held-out real excerpts evaluated separately

Full methodology on the accuracy page

Try it free View pricing

Evaluation methodology

Rigorous testing before every release

ClauseMinds uses a dedicated evaluation harness with labeled fixtures, gold documents, and adversarial examples. CI enforces a baseline so regressions are caught before release. Public numbers refer to that suite; they are not a claim about every customer PDF.

Labeled clause examples for all primary types
Gold-standard documents with known obligations
Adversarial edge cases (ambiguous language, nested clauses, multi-page terms)
Held-out real documents evaluated separately
Confidence calibration verified across bands

Evaluation metricsRegression suite (see /accuracy)

Metric	Value
Overall P/R/F1	1.00 (regression suite)
renewal	1.00
termination_notice	1.00
payment_term	1.00
Examples passed	88/88
False positives / false negatives	0 / 0 (same suite)
Held-out (2 docs)	2 type-mismatch FPs; see /accuracy

Why it matters

Six pillars of extraction quality

From precision and recall to CI enforcement and human review—every layer is designed so you can trust what ClauseMinds extracts.

Precision on the suite

On the 88-example regression suite, precision is 1.00 for primary types—no spurious matches against our labeled expectations. In production, low-confidence and edge cases route to review instead of silently passing.

Recall on the suite

On the same suite, recall is 1.00 against labeled gold: expected obligations are not missed in those fixtures. Messy real contracts can still require review; the suite is a safety rail, not a universal guarantee.

Confidence calibration

Within the regression suite, medium_high and high bands hit precision 1.00 on scored candidates. We still treat confidence as guidance, not proof, for high-stakes obligations.

Multi-Format Coverage

PDF, DOCX, TXT, HTML, XLSX, and scanned images with OCR. Native text and OCR fallback both evaluated.

Human Review Safety Net

Low-confidence extractions always route to human review before tracking. No obligation is finalized without explicit approval.

CI-Enforced Baseline

Regression tests run on every change to prevent accuracy degradation. If the suite fails, the build fails.

Safety net

Human review as the final gate

Even when the harness is green, ClauseMinds requires human review for low-confidence and high-consequence items. No obligation is tracked without explicit approval. Automation proves we did not regress on labeled cases; people handle the long tail.

Low-confidence extractions always route to review
High-consequence items require explicit approval
No obligation tracked without human sign-off
Audit trail for every accept/edit/reject

Review queueLow-confidence items

Renewal Notice
Acme MSA
medium
Termination Period
TechCo License
low
Payment Term
DataInc SaaS
medium

Trust your extractions completely

1.00 precision and recall on our 88-example regression suite, honest reporting on held-out stress tests, and human review when confidence is thin. That is how we make accuracy claims you can check.

Upload a contract Contact sales