§ blog · Data Analytics06/11/2026
← All articles

Data digitization: from paper, Excel, and legacy systems to a single source of truth

Most data digitization projects fail not because OCR misreads characters, but because no one reconciles, normalizes, and merges the data afterward. A digitization pipeline architecture: OCR/ICR, validation, entity resolution/MDM, and an incremental migration strategy — why "one source of truth" matters more than "99% character accuracy".

Data AnalyticsOCRData QualityMaster Data Management8 min read
FIG.B-06 · DIGITIZATION PIPELINEGOLDEN RECORDPAPER FORMSCAN / PDFEXCEL / CSVDIGITIZATION PIPELINEOCR / ICRVALIDATEDEDUPE+ MDMHUMAN REVIEW · ~6%3 FORMATS / 1 SCHEMADUPES −38%

"Data digitization" is often understood simply as "turning paper into files": scanning contracts, entering data from ledgers into Excel, or moving a legacy system to the cloud. But most projects stop right there — you end up with PDF files and Excel spreadsheets, but no single data source that other systems can trust and use directly. The real technical challenge of data digitization isn't "reading the text" — it's the steps that come right after: normalization, reconciliation, and merging into a single source of truth. This article breaks down the common problems and the pipeline architecture we use to turn scattered, inconsistent data into something operational and analytical systems can use directly.

The real problem: not "reading the text", but "reconciling the data"

Modern OCR/ICR tools — including Vision-Language Models like Qwen3-VL, which we covered in a previous article — achieve over 95% accuracy on printed text and most clear handwriting. But reading every character correctly doesn't solve the bigger problem: the same customer, product, or asset often appears with different spellings across multiple documents and systems — "Nguyen Van A", "NGUYEN VAN A", "Nguyen V.A"; customer code KH-00123 in the accounting system but CUST123 in the CRM. An engine that reads 100% of characters correctly is still useless if the destination system doesn't know these refer to "the same entity".

OCR, ICR, and the limits of "automatic reading"

OCR (Optical Character Recognition) handles printed text with simple layouts well. ICR (Intelligent Character Recognition) for handwriting has lower accuracy and depends heavily on photo/scan quality. For documents with complex tables, multi-column layouts, or layouts that vary between issuers — invoices, old accounting ledgers, handwritten forms — no engine reaches 100%, and forcing an extremely high accuracy threshold usually causes compute costs to balloon for very little marginal benefit.

A more practical approach is confidence-based routing: fields with high confidence are automatically passed into the system, fields with low confidence go into a queue for human review. In the KYC pipeline for a tier-1 fintech that we deployed, this approach automated 92% of applications — the remaining 8% isn't a "failure", it's the portion designed for humans to handle, because the cost of automating it exceeds the value it would add.

Data normalization — the "boring" step that determines success or failure

After extraction, raw data is almost always inconsistent: dates written in multiple formats (dd/mm/yyyy, yyyy-mm-dd, or "the 5th of June"), phone numbers with or without area codes, addresses abbreviated differently across documents. If this normalization step is skipped, every problem resurfaces downstream — wrong reports, mismatched dashboards, or worse, operational systems silently using the wrong data without anyone noticing.

  • Define a clear target schema for each field — data type, format, valid values — before writing any conversion script
  • Validate at the source: check ID checksums (national ID numbers, tax codes), reasonable value ranges (birth dates not in the future, amounts that shouldn't be negative)
  • Log every record that fails validation, with a specific reason — for batch review, instead of fixing rows by hand when errors surface later

This is the same data contract concept we covered in the article on ELT pipelines with dbt and Airflow — applied starting from the very first step of digitization, not just at the warehouse layer.

Entity resolution and Master Data Management (MDM)

After normalization, the next step is determining: which of the thousands of records from multiple sources actually represent the same entity (customer, supplier, asset, product)? This is entity resolution — combining several techniques:

  • Blocking — grouping records that are likely duplicates based on a few easily comparable fields (phone number, tax code, email) to avoid comparing every pair across the entire dataset — an n² comparison isn't feasible with millions of records
  • Fuzzy matching — comparing remaining fields (name, address) using string distance metrics (Levenshtein, Jaro-Winkler) to handle typos and spelling variations
  • Merge rules — once multiple records are identified as the same entity, clear rules are needed to decide which value is "correct" for each field — usually the most recent record, or the one from the most trusted source

The output of entity resolution is a "golden record" — a canonical record for each entity, with a unique ID and back-links to all the source records merged into it. Keeping this back-link (provenance) is just as important as the golden record itself — when an error is found, you need to know where the data came from to fix it at the root, not just patch the merged record.

Migration strategy: incremental, not big-bang

A common mistake is trying to migrate all data and cut over to a new system in one shot — a "big-bang migration". When something goes wrong (it always does), the entire operation is affected and rollback is nearly impossible after running for a while. A safer approach:

  • Dual-run — the old system and the digitization pipeline run side by side for a period, comparing results between the two before fully cutting over to the new system
  • Batch migration — by department, by document type, or by time period — each batch validated independently before expanding to the next
  • Per-batch rollback capability — if a batch reveals a data quality issue, only that batch needs to be reworked, not the entire pipeline

In the internal portal replacing 4 legacy systems that we built for a client, data from 4 old systems was merged group by group, running in parallel with the old systems for a few weeks per group — reducing new-user onboarding time by 72% after completion, because all information now lived in one place with one consistent schema.

Reference architecture: a digitization pipeline

Putting the steps above together into an end-to-end pipeline:

  • Sources — scanned paper, PDFs, scattered Excel/CSV files, or exports from legacy systems
  • OCR/ICR or parser — extracts raw data along with a confidence score per field
  • Validate — checks against the target schema, format, and business rules; failing records go to a review queue
  • Entity resolution / MDM — merges duplicate records into golden records, preserving provenance back to source
  • Golden record store — the primary data source, feeding both operational systems (CRM, ERP) and the Data Analytics layer downstream

The most important property of this architecture is that every step can be audited independently — when a number on a dashboard looks wrong, you can trace back through each step to find the actual cause, instead of treating the whole pipeline as a "black box".

Conclusion

Successful data digitization isn't measured by "how many documents have been scanned", but by whether another system — a dashboard, a CRM, or a forecasting model — can use that data directly without anyone manually cleaning it up again. OCR/ICR is only the first step; normalization, entity resolution, and a golden record with clear provenance are what determine whether a digitization project creates lasting value or is just a one-time exercise in moving files from paper to disk. This is the kind of architecture we design within the Data Analytics layer for every Pilot Build with input data coming from multiple inconsistent sources.

Have a similar problem to solve?

Contact the team