Document data extraction
Reads structured and semi-structured documents that arrive in unstructured formats — invoices, purchase orders, bills of lading, claim forms, contracts, scanned paperwork — and extracts the fields the business cares about into structured records. Routes each extracted record through validation (does it match what we expected, do the totals add up, is the supplier known), then delivers to the system of record. High-confidence extractions land directly; low-confidence ones route to a human reviewer with the source document and the proposed values shown side by side. The pattern absorbs the data-entry work that currently consumes hours of skilled time every day in operations-heavy SMBs.
Requirements describe capabilities the pattern needs in your environment, not the vendors you must buy. Any system that fills a requirement satisfies it — that’s what makes the catalog portable across the long tail of SMB tooling.
document_intake_channelWhere documents arrive for processing. Most clients have several channels feeding the same workflow.
- email inbox where suppliers send invoices
- shared drive folder where scanned documents land
- upload portal for customers or partners
- vendor portal that pushes documents via API
- OCR feed from a multi-function printer
extraction_schema_definitionsWhat fields to extract per document type. Different document types have different schemas; the pattern needs explicit definitions, not guessing.
- configuration maintained per document type by the engagement team
- small admin UI for the operations team to manage schemas
- templates derived from sample documents during build phase
reference_data_for_validationKnown-good data the extracted values get checked against: supplier list, product catalog, customer roster, GL codes.
- supplier master in the ERP or accounting system
- product catalog with SKUs
- customer database keyed by name and tax ID
- chart of accounts
structured_output_destinationWhere the extracted, validated records land. Usually the system of record the documents would have been keyed into by hand.
- bill or invoice created in the accounting system
- purchase order record in the ERP
- claim record in the operations platform
- structured entry in the workflow management tool
human_review_queueWhere low-confidence extractions go for human verification. Reviewer sees the source document and the proposed values side by side, with the uncertain fields highlighted.
- review UI built for the engagement showing document and fields side by side
- queue inside the accounting or ERP system
- dedicated app for the operations reviewers
exception_workflowWhere documents that fundamentally can't be processed go: unknown supplier, malformed document, suspected duplicate, suspected fraud.
- exception queue in the accounting system
- ticket created for finance operations
- dedicated investigation folder with notifications
source_document_archiveLong-term storage of original documents, linked from the structured records for audit and reference.
- document management system with retention policies
- archived folder in the file store with structured naming
- attachment field on the destination record
- 01A new document arrives on one of the intake channels
document_intake_channel - 02Classify document type and select the matching extraction schema
extraction_schema_definitions - 03Extract field values from the document, scoring each field's extraction confidence
- 04Validate extracted values against reference data (does this supplier exist, are line items recognized, do totals match)
reference_data_for_validation - 05Classify the result: high-confidence + valid → auto-process; low-confidence or validation failures → review; fundamental problems → exceptionDECISION Three branches based on confidence and validation outcome.
- 06For auto-process: write structured record to destination and archive source
structured_output_destinationsource_document_archive - 07For review: queue with source and proposed values for human verification
human_review_queue - 08For exception: route to exception workflow with classification and evidence
exception_workflow - 09On reviewer confirmation, write the record and feed corrections back for schema and confidence tuning
structured_output_destinationsource_document_archive
Structured outputs this pattern produces. Other patterns and client systems can subscribe to them, which is how the catalog composes over time.
extraction_quality_signalPer-document-type accuracy: auto-process rate, reviewer override rate, exception rate. The main metric for tuning.
- pattern quality dashboards
- schema refinement workflows
- monthly operations review
supplier_or_customer_emergence_signalNew entities appearing in documents that don't match reference data, surfaced for master-data maintenance.
- procurement
- B2 CRM hygiene if live
- master data governance
anomaly_signalDocuments flagged as suspicious — duplicates, unusual amounts, mismatched line items.
- fraud and audit workflows
- finance leadership review