How AI Reads and Extracts Data from Invoices
OCR plus large language models makes invoice data extraction reliable enough to trust. Here's what's actually happening when you upload a document.
When you upload an invoice to an AI processing system, it looks simple. The file goes in, structured data comes out. But the process behind that transformation involves multiple layers of technology, each solving a different piece of the problem.
Understanding how it works helps you know when to trust it, when to verify it, and what kinds of documents it handles well versus poorly.
Layer 1: Document Parsing
Before any AI touches the content, the system needs to get the text out of the document.
For native digital PDFs, those generated by invoicing software and sent as attachments, text extraction is straightforward. The text is embedded in the PDF structure and can be extracted directly without any image processing.
For scanned documents, photographs, or image-based PDFs, the system needs OCR. Modern OCR engines (Google Document AI, AWS Textract, and proprietary alternatives) convert images of text into machine-readable characters. Accuracy on clean, printed text is 99%+. Accuracy drops on handwriting, unusual fonts, low-resolution scans, and documents with complex layouts like multi-column tables.
Layer 2: Layout Understanding
Raw text extraction gives you characters and words, but loses the spatial structure of the document. A total amount sitting in the bottom right of a table looks the same as an address number at the top left of the page in raw text form.
Modern document AI models understand layout. They recognize headers, tables, line items, totals, and addresses based on their position and visual relationships. This spatial understanding is what allows the system to correctly distinguish the invoice total from the line item amounts, or the invoice date from the payment due date.
Layer 3: Semantic Extraction
This is where large language models enter. Once the text and layout are understood, an LLM is asked to extract specific fields: vendor name, invoice number, date, line items, net amount, VAT amount, gross total, currency, and payment reference.
The LLM handles the enormous variation in invoice formats. One vendor puts the invoice total at the top. Another puts it at the bottom. One uses “invoice amount (Rechnungsbetrag),” another uses “total amount (Gesamtbetrag),” another uses “Total due.” The LLM understands that these all mean the same thing and extracts the right value regardless of format.
For German invoices specifically, the LLM handles: sales tax (Umsatzsteuer) at 19% and 7% rates, small business (Kleinunternehmer) invoices without VAT, reverse charge EU invoices, and the specific field structure of DATEV-compatible documents.
Accuracy and Its Limits
The combined pipeline achieves over 90% field-level accuracy on standard German business invoices. This means that out of 100 invoices, more than 90 will have all key fields extracted correctly without any manual correction needed.
The remaining invoices fall into predictable categories. Handwritten or partially handwritten documents are harder. Heavily formatted or designed documents where text overlaps decorative elements cause problems. Very low quality scans, below about 150 DPI, produce OCR errors that propagate through the pipeline.
The correct mental model is that AI extraction handles the bulk accurately and flags the exceptions. You’re not removing the human from the process; you’re changing where human attention is required. Instead of manually entering every invoice, you review and correct the few that the system is uncertain about.
What Gets Extracted
For a typical German B2B invoice, the extraction pipeline captures:
- Vendor name and address
- Customer name (your business)
- Invoice number
- Invoice date
- Payment due date
- Line items with descriptions, quantities, and unit prices
- Net amount
- VAT rate and amount (multiple rates if applicable)
- Gross total
- Currency
- Bank details (IBAN, BIC)
- Tax number (Steuernummer) and VAT ID (USt-IdNr)
This structured data is what flows into your records, gets matched against bank transactions, and eventually gets exported to DATEV format for your tax advisor (Steuerberater).
How KontoMatch Applies This
KontoMatch runs this full extraction pipeline on every document you upload. Native PDFs are parsed directly; scanned documents and photographs go through OCR first. The system extracts vendor name, invoice number, date, line items, net total, VAT, and payment details automatically, and flags low-confidence fields for your review. The extracted data flows into your expense records, is matched against your bank statement, and can be exported as a DATEV EXTF file for your tax advisor (Steuerberater).