Technology general

How AI Reads and Extracts Data from Invoices

OCR plus large language models makes invoice data extraction reliable enough to trust. Here's what's actually happening when you upload a document.

By Theo Zimmermann · 2026-04-05 · 6 min read

When you upload an invoice to an AI processing system, it looks simple. The file goes in, structured data comes out. But the process behind that transformation involves multiple layers of technology, each solving a different piece of the problem.

Understanding how it works helps you know when to trust it, when to verify it, and what kinds of documents it handles well versus poorly.

Layer 1: Document Parsing

Before any AI touches the content, the system needs to get the text out of the document.

For native digital PDFs, those generated by invoicing software and sent as attachments, text extraction is straightforward. The text is embedded in the PDF structure and can be extracted directly without any image processing.

For scanned documents, photographs, or image-based PDFs, the system needs OCR. Modern OCR engines (Google Document AI, AWS Textract, and proprietary alternatives) convert images of text into machine-readable characters. Accuracy on clean, printed text is 99%+. Accuracy drops on handwriting, unusual fonts, low-resolution scans, and documents with complex layouts like multi-column tables.

Layer 2: Layout Understanding

Raw text extraction gives you characters and words, but loses the spatial structure of the document. A total amount sitting in the bottom right of a table looks the same as an address number at the top left of the page in raw text form.

Modern document AI models understand layout. They recognize headers, tables, line items, totals, and addresses based on their position and visual relationships. This spatial understanding is what allows the system to correctly distinguish the invoice total from the line item amounts, or the invoice date from the payment due date.

Layer 3: Semantic Extraction

This is where large language models enter. Once the text and layout are understood, an LLM is asked to extract specific fields: vendor name, invoice number, date, line items, net amount, VAT amount, gross total, currency, and payment reference.

The LLM handles the enormous variation in invoice formats. One vendor puts the invoice total at the top. Another puts it at the bottom. One uses “invoice amount (Rechnungsbetrag),” another uses “total amount (Gesamtbetrag),” another uses “Total due.” The LLM understands that these all mean the same thing and extracts the right value regardless of format.

For German invoices specifically, the LLM handles: sales tax (Umsatzsteuer) at 19% and 7% rates, small business (Kleinunternehmer) invoices without VAT, reverse charge EU invoices, and the specific field structure of DATEV-compatible documents.

Accuracy and Its Limits

The combined pipeline achieves over 90% field-level accuracy on standard German business invoices. This means that out of 100 invoices, more than 90 will have all key fields extracted correctly without any manual correction needed.

The remaining invoices fall into predictable categories. Handwritten or partially handwritten documents are harder. Heavily formatted or designed documents where text overlaps decorative elements cause problems. Very low quality scans, below about 150 DPI, produce OCR errors that propagate through the pipeline.

The correct mental model is that AI extraction handles the bulk accurately and flags the exceptions. You’re not removing the human from the process; you’re changing where human attention is required. Instead of manually entering every invoice, you review and correct the few that the system is uncertain about.

What Gets Extracted

For a typical German B2B invoice, the extraction pipeline captures:

Vendor name and address
Customer name (your business)
Invoice number
Invoice date
Payment due date
Line items with descriptions, quantities, and unit prices
Net amount
VAT rate and amount (multiple rates if applicable)
Gross total
Currency
Bank details (IBAN, BIC)
Tax number (Steuernummer) and VAT ID (USt-IdNr)

This structured data is what flows into your records, gets matched against bank transactions, and eventually gets exported to DATEV format for your tax advisor (Steuerberater).

How KontoMatch Applies This

KontoMatch runs this full extraction pipeline on every document you upload. Native PDFs are parsed directly; scanned documents and photographs go through OCR first. The system extracts vendor name, invoice number, date, line items, net total, VAT, and payment details automatically, and flags low-confidence fields for your review. The extracted data flows into your expense records, is matched against your bank statement, and can be exported as a DATEV EXTF file for your tax advisor (Steuerberater).

How AI Reads and Extracts Data from Invoices

Layer 1: Document Parsing

Layer 2: Layout Understanding

Layer 3: Semantic Extraction

Accuracy and Its Limits

What Gets Extracted

How KontoMatch Applies This

Keep reading

How AI is Transforming Invoice Processing

Replace Manual Bookkeeping with AI: A Practical Guide

Excel Expense Tracking vs. Automated Software: A Realistic Comparison