Document AI systems have changed how organizations handle paperwork, contracts, invoices, and forms—but the terminology surrounding these systems can be a real barrier. Whether you are evaluating a vendor platform such as Google Document AI, onboarding a new tool like LlamaParse, or building a Document AI pipeline from scratch, understanding the language of the field is a prerequisite for making informed decisions. This Document AI glossary defines the most important terms in plain language, organized by category, so you can quickly find and apply the concepts most relevant to your work.
As intelligent document processing continues to evolve, teams need a shared vocabulary to compare vendors, interpret product claims, and design workflows that actually hold up in production. The goal of this guide is to make that vocabulary practical rather than abstract.
Quick Reference: All Document AI Terms at a Glance
The table below consolidates every term covered in this article into a single scannable view. Use it to locate a specific term quickly or to get a high-level sense of the article's scope before reading in depth.
| Term | Category | One-Line Summary | Section |
|---|---|---|---|
| OCR | Core Document AI Term | Converts text in images or scanned documents into machine-readable format | Core Document AI Terms |
| IDP | Core Document AI Term | Combines OCR, NLP, and ML to automate end-to-end document processing | Core Document AI Terms |
| NLP | Core Document AI Term | Enables machines to understand and interpret meaning from document text | Core Document AI Terms |
| Document Parsing | Core Document AI Term | Analyzes document structure to extract specific data fields in a usable format | Core Document AI Terms |
| NER | AI/ML Concept | Identifies and categorizes entities like names, dates, and amounts in text | AI/ML Concepts |
| Document Classification | AI/ML Concept | Automatically categorizes documents by type based on content or structure | AI/ML Concepts |
| Confidence Score | AI/ML Concept | Numerical value indicating how certain an AI model is about an extracted result | AI/ML Concepts |
| Data Extraction | AI/ML Concept | Automated retrieval of specific fields or values from a document | AI/ML Concepts |
| Model Training | AI/ML Concept | Teaching an AI system to recognize document patterns using labeled examples | AI/ML Concepts |
| Ingestion | Workflow Stage | Initial step of receiving and importing documents into a Document AI system | Workflow Terminology |
| Preprocessing | Workflow Stage | Prepares raw documents for AI processing through image cleanup and correction | Workflow Terminology |
| Annotation | Workflow Stage | Manual labeling of document data to build AI training datasets | Workflow Terminology |
| Validation | Workflow Stage | Verifies extracted data against business rules before passing it downstream | Workflow Terminology |
| Post-Processing | Workflow Stage | Formats, enriches, or routes extracted data to target systems | Workflow Terminology |
Four Foundational Document AI Concepts Defined
This section covers the foundational terms you are most likely to encounter when first exploring Document AI tools, research, or vendor documentation, especially if you are new to broader document understanding concepts. Each term is defined in plain language and paired with a real-world example to make the idea concrete.
The table below presents all four core terms in a consolidated format for fast reference. Full context and additional detail follow the table.
| Term | Full Name | Plain-Language Definition | Real-World Example |
|---|---|---|---|
| OCR | Optical Character Recognition | Technology that converts typed, handwritten, or printed text from images or scanned documents into machine-readable text | Scanning a paper invoice so that the vendor name, line items, and total automatically appear as editable, searchable text in your accounting system |
| IDP | Intelligent Document Processing | An advanced approach that combines OCR, NLP, and machine learning to extract, classify, and validate data from documents with minimal human intervention | Automatically processing hundreds of incoming supplier invoices each day—extracting key fields, routing them by document type, and flagging exceptions—without manual data entry |
| NLP | Natural Language Processing | The AI capability that enables machines to understand, interpret, and derive meaning from text within documents | A system reading a contract and identifying that a clause refers to a payment deadline, not just detecting the word "payment" as a string of characters |
| Document Parsing | Document Parsing | The process of analyzing a document's structure and content to extract specific data fields in a usable, structured format | Breaking a multi-page PDF purchase order into its component fields—PO number, line items, quantities, and delivery address—so each value can be stored separately in a database |
OCR (Optical Character Recognition)
OCR is the foundational technology that makes documents machine-readable. Without OCR, a scanned image of a form is just a picture—OCR is what converts it into text that software can process, search, and analyze.
Modern OCR engines handle a wide range of input quality, including low-resolution scans, handwritten notes, and multi-language documents. However, OCR alone does not understand what the text means—it only converts it. That distinction matters when evaluating Document AI tools.
IDP (Intelligent Document Processing)
IDP builds on OCR by adding layers of intelligence. Where OCR reads text, IDP understands it—classifying the document type, extracting relevant fields, and validating the output against expected formats or business rules.
IDP is the term most commonly used to describe enterprise-grade document automation platforms. It is also closely tied to the idea of end-to-end Document AI, where the system handles the full path from document intake through extraction, validation, and delivery rather than stopping at text recognition.
NLP (Natural Language Processing)
In the document context, NLP is what allows a system to distinguish between a date that represents a contract start date versus a contract end date, even when both appear in similar sentence structures. It enables semantic understanding, not just pattern matching.
NLP is the component of Document AI that handles ambiguity, context, and language variation. It is what allows a system to correctly extract a "total amount due" field even when different invoices phrase that label differently.
Document Parsing
Document parsing refers specifically to the structural analysis of a document—identifying where sections, tables, headers, and data fields are located, and extracting their values in a format that downstream systems can consume.
Parsing is distinct from OCR in that it operates on the document's logical structure, not just its raw text. As newer approaches such as prompt-based document parsing become more common, parsing increasingly involves flexible reasoning about layout and context rather than relying only on rigid templates.
AI and Machine Learning Concepts in Document Processing
The terms in this section appear frequently in Document AI product interfaces, vendor documentation, and technical specifications. Understanding them in the document-specific context—rather than in the broader AI sense—is essential for evaluating tools and interpreting their outputs accurately.
The table below defines each concept, identifies where you are likely to encounter it in a Document AI tool, and explains why it matters for your workflow.
| Term | Document-Specific Definition | Where You'll See This in a Document AI Tool | Why It Matters for Your Workflow |
|---|---|---|---|
| NER (Named Entity Recognition) | Identifies and categorizes key entities within document text, such as names, dates, amounts, and addresses | Highlighted or tagged fields in an extraction results view, often color-coded by entity type | Enables precise extraction of structured data from unstructured text without requiring exact field labels |
| Document Classification | Automated categorization of documents by type (e.g., invoice, contract, receipt) based on content or structure | A "Document Type" label assigned to each uploaded file before extraction begins | Determines which extraction model or ruleset is applied, directly affecting accuracy |
| Confidence Score | A numerical value indicating how certain the AI model is about an extracted data point | A percentage displayed next to each extracted field; low-scoring fields are flagged for human review | Allows teams to focus manual review effort on uncertain results rather than reviewing every extraction |
| Data Extraction | Automated identification and retrieval of specific fields or values from a document | The core output of any Document AI tool—a structured list of field names and their extracted values | Replaces manual data entry; accuracy here determines the downstream reliability of your data |
| Model Training | The process of teaching an AI system to recognize patterns in documents using labeled example data | A training interface where users upload sample documents and label the fields they want the model to learn | Determines how well the system handles your specific document types; more labeled examples generally improve accuracy |
Named Entity Recognition (NER)
NER is the AI technique that identifies and categorizes key entities within document text—such as names, dates, monetary amounts, and addresses—and assigns each a semantic label. For example, a NER model reading a lease agreement would identify "March 1, 2025" as a date entity and "$3,500" as a monetary amount entity.
In Document AI, NER is particularly valuable for processing unstructured documents where data does not appear in consistent locations or labeled fields. In more complex files, extraction may also depend on multi-step document reasoning, where the system connects clues across sentences, sections, or pages before assigning the right meaning to a field.
Document Classification
Document classification is the automated process of categorizing an incoming document by type before any extraction occurs. A system might classify an uploaded file as an invoice, a purchase order, a W-2 form, or a legal contract—and then apply the appropriate extraction logic for that document type.
Classification accuracy is foundational to the rest of the pipeline. If a document is misclassified, the wrong extraction model may be applied, resulting in incorrect or missing data. For a deeper look at why this step matters so much in production workflows, it helps to review how AI document classification affects routing, model selection, and downstream accuracy.
Confidence Score
A confidence score is a numerical value—typically expressed as a percentage—that an AI model assigns to each extracted data point to indicate how certain it is about that result. A confidence score of 97% on an extracted invoice total means the model is highly certain; a score of 54% signals that the result should be reviewed by a human.
Most Document AI platforms allow users to set a confidence threshold below which results are automatically routed for manual review. This mechanism is central to balancing automation speed with data accuracy.
Data Extraction
Data extraction is the core function of any Document AI system: the automated identification and retrieval of specific fields or values from a document. For an invoice, this might include the vendor name, invoice number, line items, tax amount, and total due.
Extraction can be template-based (relying on fixed field positions) or model-based (using AI to locate fields regardless of layout variation). Model-based extraction is more flexible and is the standard approach in modern IDP platforms.
Model Training
Model training in the Document AI context is the process of teaching an AI system to recognize and extract specific patterns from documents by exposing it to labeled examples. A user uploads sample invoices, manually labels the fields of interest (e.g., "this text is the invoice number"), and the model learns to identify those fields in new, unseen documents.
The quality and quantity of training data directly affect extraction accuracy. Most platforms require a minimum number of labeled examples per document type before a model can be deployed reliably.
How Documents Move Through a Document AI Pipeline
Understanding how documents move through a Document AI system—from initial receipt to final data delivery—is essential for evaluating solutions, communicating with vendors, and diagnosing problems in a live implementation. Teams often pair a high-level glossary like this one with a more implementation-oriented reference such as the LlamaParse developer glossary when translating concepts into actual workflow design.
The table below maps each workflow stage sequentially, describing what happens at each step, the key activities involved, and the input and output at each stage.
| Stage | Definition | Key Activities | Input → Output |
|---|---|---|---|
| 1. Ingestion | The initial step of receiving and importing documents into a Document AI system from various sources | Connecting to email inboxes, scanners, cloud storage, or APIs; file format detection and routing | Raw document files (PDF, image, email attachment) → Documents queued for processing |
| 2. Preprocessing | Preparation of raw documents for AI processing to improve recognition accuracy | Image enhancement, deskewing, noise removal, resolution normalization, page segmentation | Raw or low-quality document image → Cleaned, optimized image ready for AI analysis |
| 3. Annotation | Manual labeling of document data to create training datasets for AI models | Selecting text regions, assigning field labels, reviewing and correcting labels, exporting labeled datasets | Unlabeled sample documents → Labeled training data |
| 4. Extraction | Automated identification and retrieval of data fields from preprocessed documents | Applying OCR, NER, and parsing models; generating field-value pairs; assigning confidence scores | Preprocessed document → Structured data fields with confidence scores |
| 5. Validation | Verification of extracted data against business rules or expected formats | Checking field formats (e.g., date patterns, numeric ranges), cross-referencing against master data, flagging anomalies | Extracted data fields → Verified data or flagged exceptions for human review |
| 6. Post-Processing | Final formatting, enrichment, and routing of validated data to target systems | Data transformation, deduplication, field mapping, API calls to ERPs or databases | Validated structured data → Delivered records in target system format |
Ingestion
Ingestion is the entry point of the Document AI pipeline. It encompasses every mechanism by which documents enter the system—whether uploaded manually, pulled from a shared drive, received via email, or captured through a scanner integration.
The ingestion layer also handles file format detection and initial routing. A well-designed ingestion setup ensures that documents from diverse sources are normalized into a consistent format before any processing begins.
Preprocessing
Preprocessing addresses the reality that real-world documents are rarely perfect. Scanned pages may be skewed, faded, or low-resolution; photographs of documents may have shadows or distortion. Preprocessing applies corrective operations—such as deskewing, contrast enhancement, and noise removal—to improve the accuracy of downstream OCR and extraction.
Skipping or underinvesting in preprocessing is one of the most common causes of poor extraction accuracy. The quality of the preprocessed image directly determines the ceiling of what the AI can reliably extract.
Annotation
Annotation is the human-in-the-loop step that makes model training possible. Subject matter experts review sample documents and manually label the data fields the AI should learn to extract—for example, drawing a bounding box around an invoice number and tagging it as "invoice_number."
Because label quality has such a direct effect on model performance, it helps to think in terms of structured annotation for Document AI rather than ad hoc review. Inconsistent or incorrect labels produce models that learn the wrong patterns, leading to systematic extraction errors.
Validation
Validation is the quality control step of the pipeline. After extraction, the system checks each extracted value against predefined rules—verifying that a date field contains a valid date, that a total matches the sum of line items, or that a vendor name exists in an approved supplier list.
Results that fail validation rules are flagged for human review rather than passed downstream automatically. This step is what allows organizations to maintain data quality standards while still achieving high levels of automation.
Post-Processing
Post-processing is the final stage, where validated data is converted into the format required by the target system and delivered to its destination. This may involve mapping extracted field names to database column names, converting date formats, triggering API calls to an ERP, or generating a structured output file.
Post-processing is often underestimated in implementation planning. Even when extraction and validation work correctly, data that arrives in the wrong format or through the wrong integration pathway can cause downstream failures.
Final Thoughts
The terminology covered in this article—from foundational concepts like OCR, IDP, and document parsing, to AI/ML techniques like NER and confidence scoring, to pipeline stages like ingestion, validation, and post-processing—forms the working vocabulary of Document AI. Familiarity with these terms enables clearer communication with vendors, more accurate evaluation of platform capabilities, and faster diagnosis of issues in live implementations. Understanding how each concept connects to the others, particularly how workflow stages depend sequentially on one another, is what turns a list of definitions into a practical mental model for working with these systems.
LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.