What is Document AI Glossary?

Document AI systems have changed how organizations handle paperwork, contracts, invoices, and forms—but the terminology surrounding these systems can be a real barrier. Whether you are evaluating a vendor platform such as Google Document AI, onboarding a new tool like LlamaParse, or building a Document AI pipeline from scratch, understanding the language of the field is a prerequisite for making informed decisions. This Document AI glossary defines the most important terms in plain language, organized by category, so you can quickly find and apply the concepts most relevant to your work.

As intelligent document processing continues to evolve, teams need a shared vocabulary to compare vendors, interpret product claims, and design workflows that actually hold up in production. The goal of this guide is to make that vocabulary practical rather than abstract.

Quick Reference: All Document AI Terms at a Glance

The table below consolidates every term covered in this article into a single scannable view. Use it to locate a specific term quickly or to get a high-level sense of the article's scope before reading in depth.

Term	Category	One-Line Summary	Section
OCR	Core Document AI Term	Converts text in images or scanned documents into machine-readable format	Core Document AI Terms
IDP	Core Document AI Term	Combines OCR, NLP, and ML to automate end-to-end document processing	Core Document AI Terms
NLP	Core Document AI Term	Enables machines to understand and interpret meaning from document text	Core Document AI Terms
Document Parsing	Core Document AI Term	Analyzes document structure to extract specific data fields in a usable format	Core Document AI Terms
NER	AI/ML Concept	Identifies and categorizes entities like names, dates, and amounts in text	AI/ML Concepts
Document Classification	AI/ML Concept	Automatically categorizes documents by type based on content or structure	AI/ML Concepts
Confidence Score	AI/ML Concept	Numerical value indicating how certain an AI model is about an extracted result	AI/ML Concepts
Data Extraction	AI/ML Concept	Automated retrieval of specific fields or values from a document	AI/ML Concepts
Model Training	AI/ML Concept	Teaching an AI system to recognize document patterns using labeled examples	AI/ML Concepts
Ingestion	Workflow Stage	Initial step of receiving and importing documents into a Document AI system	Workflow Terminology
Preprocessing	Workflow Stage	Prepares raw documents for AI processing through image cleanup and correction	Workflow Terminology
Annotation	Workflow Stage	Manual labeling of document data to build AI training datasets	Workflow Terminology
Validation	Workflow Stage	Verifies extracted data against business rules before passing it downstream	Workflow Terminology
Post-Processing	Workflow Stage	Formats, enriches, or routes extracted data to target systems	Workflow Terminology

Four Foundational Document AI Concepts Defined

This section covers the foundational terms you are most likely to encounter when first exploring Document AI tools, research, or vendor documentation, especially if you are new to broader document understanding concepts. Each term is defined in plain language and paired with a real-world example to make the idea concrete.

The table below presents all four core terms in a consolidated format for fast reference. Full context and additional detail follow the table.

Term	Full Name	Plain-Language Definition	Real-World Example
OCR	Optical Character Recognition	Technology that converts typed, handwritten, or printed text from images or scanned documents into machine-readable text	Scanning a paper invoice so that the vendor name, line items, and total automatically appear as editable, searchable text in your accounting system
IDP	Intelligent Document Processing	An advanced approach that combines OCR, NLP, and machine learning to extract, classify, and validate data from documents with minimal human intervention	Automatically processing hundreds of incoming supplier invoices each day—extracting key fields, routing them by document type, and flagging exceptions—without manual data entry
NLP	Natural Language Processing	The AI capability that enables machines to understand, interpret, and derive meaning from text within documents	A system reading a contract and identifying that a clause refers to a payment deadline, not just detecting the word "payment" as a string of characters
Document Parsing	Document Parsing	The process of analyzing a document's structure and content to extract specific data fields in a usable, structured format	Breaking a multi-page PDF purchase order into its component fields—PO number, line items, quantities, and delivery address—so each value can be stored separately in a database

OCR (Optical Character Recognition)

OCR is the foundational technology that makes documents machine-readable. Without OCR, a scanned image of a form is just a picture—OCR is what converts it into text that software can process, search, and analyze.

Modern OCR engines handle a wide range of input quality, including low-resolution scans, handwritten notes, and multi-language documents. However, OCR alone does not understand what the text means—it only converts it. That distinction matters when evaluating Document AI tools.

IDP (Intelligent Document Processing)

IDP builds on OCR by adding layers of intelligence. Where OCR reads text, IDP understands it—classifying the document type, extracting relevant fields, and validating the output against expected formats or business rules.

IDP is the term most commonly used to describe enterprise-grade document automation platforms. It is also closely tied to the idea of end-to-end Document AI, where the system handles the full path from document intake through extraction, validation, and delivery rather than stopping at text recognition.

NLP (Natural Language Processing)

In the document context, NLP is what allows a system to distinguish between a date that represents a contract start date versus a contract end date, even when both appear in similar sentence structures. It enables semantic understanding, not just pattern matching.

NLP is the component of Document AI that handles ambiguity, context, and language variation. It is what allows a system to correctly extract a "total amount due" field even when different invoices phrase that label differently.

Document Parsing

Document parsing refers specifically to the structural analysis of a document—identifying where sections, tables, headers, and data fields are located, and extracting their values in a format that downstream systems can consume.

Parsing is distinct from OCR in that it operates on the document's logical structure, not just its raw text. As newer approaches such as prompt-based document parsing become more common, parsing increasingly involves flexible reasoning about layout and context rather than relying only on rigid templates.

AI and Machine Learning Concepts in Document Processing

The terms in this section appear frequently in Document AI product interfaces, vendor documentation, and technical specifications. Understanding them in the document-specific context—rather than in the broader AI sense—is essential for evaluating tools and interpreting their outputs accurately.

The table below defines each concept, identifies where you are likely to encounter it in a Document AI tool, and explains why it matters for your workflow.

Term	Document-Specific Definition	Where You'll See This in a Document AI Tool	Why It Matters for Your Workflow
NER (Named Entity Recognition)	Identifies and categorizes key entities within document text, such as names, dates, amounts, and addresses	Highlighted or tagged fields in an extraction results view, often color-coded by entity type	Enables precise extraction of structured data from unstructured text without requiring exact field labels
Document Classification	Automated categorization of documents by type (e.g., invoice, contract, receipt) based on content or structure	A "Document Type" label assigned to each uploaded file before extraction begins	Determines which extraction model or ruleset is applied, directly affecting accuracy
Confidence Score	A numerical value indicating how certain the AI model is about an extracted data point	A percentage displayed next to each extracted field; low-scoring fields are flagged for human review	Allows teams to focus manual review effort on uncertain results rather than reviewing every extraction
Data Extraction	Automated identification and retrieval of specific fields or values from a document	The core output of any Document AI tool—a structured list of field names and their extracted values	Replaces manual data entry; accuracy here determines the downstream reliability of your data
Model Training	The process of teaching an AI system to recognize patterns in documents using labeled example data	A training interface where users upload sample documents and label the fields they want the model to learn	Determines how well the system handles your specific document types; more labeled examples generally improve accuracy

Named Entity Recognition (NER)

NER is the AI technique that identifies and categorizes key entities within document text—such as names, dates, monetary amounts, and addresses—and assigns each a semantic label. For example, a NER model reading a lease agreement would identify "March 1, 2025" as a date entity and "$3,500" as a monetary amount entity.

In Document AI, NER is particularly valuable for processing unstructured documents where data does not appear in consistent locations or labeled fields. In more complex files, extraction may also depend on multi-step document reasoning, where the system connects clues across sentences, sections, or pages before assigning the right meaning to a field.

Document Classification

Document classification is the automated process of categorizing an incoming document by type before any extraction occurs. A system might classify an uploaded file as an invoice, a purchase order, a W-2 form, or a legal contract—and then apply the appropriate extraction logic for that document type.

Classification accuracy is foundational to the rest of the pipeline. If a document is misclassified, the wrong extraction model may be applied, resulting in incorrect or missing data. For a deeper look at why this step matters so much in production workflows, it helps to review how AI document classification affects routing, model selection, and downstream accuracy.

Confidence Score

A confidence score is a numerical value—typically expressed as a percentage—that an AI model assigns to each extracted data point to indicate how certain it is about that result. A confidence score of 97% on an extracted invoice total means the model is highly certain; a score of 54% signals that the result should be reviewed by a human.

Most Document AI platforms allow users to set a confidence threshold below which results are automatically routed for manual review. This mechanism is central to balancing automation speed with data accuracy.

Data Extraction

Data extraction is the core function of any Document AI system: the automated identification and retrieval of specific fields or values from a document. For an invoice, this might include the vendor name, invoice number, line items, tax amount, and total due.

Extraction can be template-based (relying on fixed field positions) or model-based (using AI to locate fields regardless of layout variation). Model-based extraction is more flexible and is the standard approach in modern IDP platforms.

Model Training

Model training in the Document AI context is the process of teaching an AI system to recognize and extract specific patterns from documents by exposing it to labeled examples. A user uploads sample invoices, manually labels the fields of interest (e.g., "this text is the invoice number"), and the model learns to identify those fields in new, unseen documents.

The quality and quantity of training data directly affect extraction accuracy. Most platforms require a minimum number of labeled examples per document type before a model can be deployed reliably.

How Documents Move Through a Document AI Pipeline

Understanding how documents move through a Document AI system—from initial receipt to final data delivery—is essential for evaluating solutions, communicating with vendors, and diagnosing problems in a live implementation. Teams often pair a high-level glossary like this one with a more implementation-oriented reference such as the LlamaParse developer glossary when translating concepts into actual workflow design.

The table below maps each workflow stage sequentially, describing what happens at each step, the key activities involved, and the input and output at each stage.

Stage	Definition	Key Activities	Input → Output
1. Ingestion	The initial step of receiving and importing documents into a Document AI system from various sources	Connecting to email inboxes, scanners, cloud storage, or APIs; file format detection and routing	Raw document files (PDF, image, email attachment) → Documents queued for processing
2. Preprocessing	Preparation of raw documents for AI processing to improve recognition accuracy	Image enhancement, deskewing, noise removal, resolution normalization, page segmentation	Raw or low-quality document image → Cleaned, optimized image ready for AI analysis
3. Annotation	Manual labeling of document data to create training datasets for AI models	Selecting text regions, assigning field labels, reviewing and correcting labels, exporting labeled datasets	Unlabeled sample documents → Labeled training data
4. Extraction	Automated identification and retrieval of data fields from preprocessed documents	Applying OCR, NER, and parsing models; generating field-value pairs; assigning confidence scores	Preprocessed document → Structured data fields with confidence scores
5. Validation	Verification of extracted data against business rules or expected formats	Checking field formats (e.g., date patterns, numeric ranges), cross-referencing against master data, flagging anomalies	Extracted data fields → Verified data or flagged exceptions for human review
6. Post-Processing	Final formatting, enrichment, and routing of validated data to target systems	Data transformation, deduplication, field mapping, API calls to ERPs or databases	Validated structured data → Delivered records in target system format

Ingestion

Ingestion is the entry point of the Document AI pipeline. It encompasses every mechanism by which documents enter the system—whether uploaded manually, pulled from a shared drive, received via email, or captured through a scanner integration.

The ingestion layer also handles file format detection and initial routing. A well-designed ingestion setup ensures that documents from diverse sources are normalized into a consistent format before any processing begins.

Preprocessing

Preprocessing addresses the reality that real-world documents are rarely perfect. Scanned pages may be skewed, faded, or low-resolution; photographs of documents may have shadows or distortion. Preprocessing applies corrective operations—such as deskewing, contrast enhancement, and noise removal—to improve the accuracy of downstream OCR and extraction.

Skipping or underinvesting in preprocessing is one of the most common causes of poor extraction accuracy. The quality of the preprocessed image directly determines the ceiling of what the AI can reliably extract.

Annotation

Annotation is the human-in-the-loop step that makes model training possible. Subject matter experts review sample documents and manually label the data fields the AI should learn to extract—for example, drawing a bounding box around an invoice number and tagging it as "invoice_number."

Because label quality has such a direct effect on model performance, it helps to think in terms of structured annotation for Document AI rather than ad hoc review. Inconsistent or incorrect labels produce models that learn the wrong patterns, leading to systematic extraction errors.

Validation

Validation is the quality control step of the pipeline. After extraction, the system checks each extracted value against predefined rules—verifying that a date field contains a valid date, that a total matches the sum of line items, or that a vendor name exists in an approved supplier list.

Results that fail validation rules are flagged for human review rather than passed downstream automatically. This step is what allows organizations to maintain data quality standards while still achieving high levels of automation.

Post-Processing

Post-processing is the final stage, where validated data is converted into the format required by the target system and delivered to its destination. This may involve mapping extracted field names to database column names, converting date formats, triggering API calls to an ERP, or generating a structured output file.

Post-processing is often underestimated in implementation planning. Even when extraction and validation work correctly, data that arrives in the wrong format or through the wrong integration pathway can cause downstream failures.

Final Thoughts

The terminology covered in this article—from foundational concepts like OCR, IDP, and document parsing, to AI/ML techniques like NER and confidence scoring, to pipeline stages like ingestion, validation, and post-processing—forms the working vocabulary of Document AI. Familiarity with these terms enables clearer communication with vendors, more accurate evaluation of platform capabilities, and faster diagnosis of issues in live implementations. Understanding how each concept connects to the others, particularly how workflow stages depend sequentially on one another, is what turns a list of definitions into a practical mental model for working with these systems.

LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.