Regression-Tested Extraction You Can Audit
Primary obligation types—renewal, termination notice, and payment terms—score 1.00 precision and recall on our 88-example internal regression suite (labeled clauses, gold documents, and adversarial cases). A separate two-document held-out stress test still surfaces occasional type mismatches; human review and confidence routing cover real-world variance.
Evaluation methodology
Rigorous testing before every release
ClauseMinds uses a dedicated evaluation harness with labeled fixtures, gold documents, and adversarial examples. CI enforces a baseline so regressions are caught before release. Public numbers refer to that suite; they are not a claim about every customer PDF.
- Labeled clause examples for all primary types
- Gold-standard documents with known obligations
- Adversarial edge cases (ambiguous language, nested clauses, multi-page terms)
- Held-out real documents evaluated separately
- Confidence calibration verified across bands
| Metric | Value |
|---|---|
| Overall P/R/F1 | 1.00 (regression suite) |
| renewal | 1.00 |
| termination_notice | 1.00 |
| payment_term | 1.00 |
| Examples passed | 88/88 |
| False positives / false negatives | 0 / 0 (same suite) |
| Held-out (2 docs) | 2 type-mismatch FPs; see /accuracy |
Why it matters
Six pillars of extraction quality
From precision and recall to CI enforcement and human review—every layer is designed so you can trust what ClauseMinds extracts.
Precision on the suite
On the 88-example regression suite, precision is 1.00 for primary types—no spurious matches against our labeled expectations. In production, low-confidence and edge cases route to review instead of silently passing.
Recall on the suite
On the same suite, recall is 1.00 against labeled gold: expected obligations are not missed in those fixtures. Messy real contracts can still require review; the suite is a safety rail, not a universal guarantee.
Confidence calibration
Within the regression suite, medium_high and high bands hit precision 1.00 on scored candidates. We still treat confidence as guidance, not proof, for high-stakes obligations.
Multi-Format Coverage
PDF, DOCX, TXT, HTML, XLSX, and scanned images with OCR. Native text and OCR fallback both evaluated.
Human Review Safety Net
Low-confidence extractions always route to human review before tracking. No obligation is finalized without explicit approval.
CI-Enforced Baseline
Regression tests run on every change to prevent accuracy degradation. If the suite fails, the build fails.
Safety net
Human review as the final gate
Even when the harness is green, ClauseMinds requires human review for low-confidence and high-consequence items. No obligation is tracked without explicit approval. Automation proves we did not regress on labeled cases; people handle the long tail.
- Low-confidence extractions always route to review
- High-consequence items require explicit approval
- No obligation tracked without human sign-off
- Audit trail for every accept/edit/reject
- medium
Renewal Notice
Acme MSA
- low
Termination Period
TechCo License
- medium
Payment Term
DataInc SaaS
Trust your extractions completely
1.00 precision and recall on our 88-example regression suite, honest reporting on held-out stress tests, and human review when confidence is thin. That is how we make accuracy claims you can check.