Signup to LlamaParse for 10k free credits!

Document AI Glossary

Document AI systems have changed how organizations handle paperwork, contracts, invoices, and forms—but the terminology surrounding these systems can be a real barrier. Whether you are evaluating a vendor platform such as Google Document AI, onboarding a new tool like LlamaParse, or building a Document AI pipeline from scratch, understanding the language of the field is a prerequisite for making informed decisions. This Document AI glossary defines the most important terms in plain language, organized by category, so you can quickly find and apply the concepts most relevant to your work.

As intelligent document processing continues to evolve, teams need a shared vocabulary to compare vendors, interpret product claims, and design workflows that actually hold up in production. The goal of this guide is to make that vocabulary practical rather than abstract.

Quick Reference: All Document AI Terms at a Glance

The table below consolidates every term covered in this article into a single scannable view. Use it to locate a specific term quickly or to get a high-level sense of the article's scope before reading in depth.

TermCategoryOne-Line SummarySection
OCRCore Document AI TermConverts text in images or scanned documents into machine-readable formatCore Document AI Terms
IDPCore Document AI TermCombines OCR, NLP, and ML to automate end-to-end document processingCore Document AI Terms
NLPCore Document AI TermEnables machines to understand and interpret meaning from document textCore Document AI Terms
Document ParsingCore Document AI TermAnalyzes document structure to extract specific data fields in a usable formatCore Document AI Terms
NERAI/ML ConceptIdentifies and categorizes entities like names, dates, and amounts in textAI/ML Concepts
Document ClassificationAI/ML ConceptAutomatically categorizes documents by type based on content or structureAI/ML Concepts
Confidence ScoreAI/ML ConceptNumerical value indicating how certain an AI model is about an extracted resultAI/ML Concepts
Data ExtractionAI/ML ConceptAutomated retrieval of specific fields or values from a documentAI/ML Concepts
Model TrainingAI/ML ConceptTeaching an AI system to recognize document patterns using labeled examplesAI/ML Concepts
IngestionWorkflow StageInitial step of receiving and importing documents into a Document AI systemWorkflow Terminology
PreprocessingWorkflow StagePrepares raw documents for AI processing through image cleanup and correctionWorkflow Terminology
AnnotationWorkflow StageManual labeling of document data to build AI training datasetsWorkflow Terminology
ValidationWorkflow StageVerifies extracted data against business rules before passing it downstreamWorkflow Terminology
Post-ProcessingWorkflow StageFormats, enriches, or routes extracted data to target systemsWorkflow Terminology

Four Foundational Document AI Concepts Defined

This section covers the foundational terms you are most likely to encounter when first exploring Document AI tools, research, or vendor documentation, especially if you are new to broader document understanding concepts. Each term is defined in plain language and paired with a real-world example to make the idea concrete.

The table below presents all four core terms in a consolidated format for fast reference. Full context and additional detail follow the table.

TermFull NamePlain-Language DefinitionReal-World Example
OCROptical Character RecognitionTechnology that converts typed, handwritten, or printed text from images or scanned documents into machine-readable textScanning a paper invoice so that the vendor name, line items, and total automatically appear as editable, searchable text in your accounting system
IDPIntelligent Document ProcessingAn advanced approach that combines OCR, NLP, and machine learning to extract, classify, and validate data from documents with minimal human interventionAutomatically processing hundreds of incoming supplier invoices each day—extracting key fields, routing them by document type, and flagging exceptions—without manual data entry
NLPNatural Language ProcessingThe AI capability that enables machines to understand, interpret, and derive meaning from text within documentsA system reading a contract and identifying that a clause refers to a payment deadline, not just detecting the word "payment" as a string of characters
Document ParsingDocument ParsingThe process of analyzing a document's structure and content to extract specific data fields in a usable, structured formatBreaking a multi-page PDF purchase order into its component fields—PO number, line items, quantities, and delivery address—so each value can be stored separately in a database

OCR (Optical Character Recognition)

OCR is the foundational technology that makes documents machine-readable. Without OCR, a scanned image of a form is just a picture—OCR is what converts it into text that software can process, search, and analyze.

Modern OCR engines handle a wide range of input quality, including low-resolution scans, handwritten notes, and multi-language documents. However, OCR alone does not understand what the text means—it only converts it. That distinction matters when evaluating Document AI tools.

IDP (Intelligent Document Processing)

IDP builds on OCR by adding layers of intelligence. Where OCR reads text, IDP understands it—classifying the document type, extracting relevant fields, and validating the output against expected formats or business rules.

IDP is the term most commonly used to describe enterprise-grade document automation platforms. It is also closely tied to the idea of end-to-end Document AI, where the system handles the full path from document intake through extraction, validation, and delivery rather than stopping at text recognition.

NLP (Natural Language Processing)

In the document context, NLP is what allows a system to distinguish between a date that represents a contract start date versus a contract end date, even when both appear in similar sentence structures. It enables semantic understanding, not just pattern matching.

NLP is the component of Document AI that handles ambiguity, context, and language variation. It is what allows a system to correctly extract a "total amount due" field even when different invoices phrase that label differently.

Document Parsing

Document parsing refers specifically to the structural analysis of a document—identifying where sections, tables, headers, and data fields are located, and extracting their values in a format that downstream systems can consume.

Parsing is distinct from OCR in that it operates on the document's logical structure, not just its raw text. As newer approaches such as prompt-based document parsing become more common, parsing increasingly involves flexible reasoning about layout and context rather than relying only on rigid templates.

AI and Machine Learning Concepts in Document Processing

The terms in this section appear frequently in Document AI product interfaces, vendor documentation, and technical specifications. Understanding them in the document-specific context—rather than in the broader AI sense—is essential for evaluating tools and interpreting their outputs accurately.

The table below defines each concept, identifies where you are likely to encounter it in a Document AI tool, and explains why it matters for your workflow.

TermDocument-Specific DefinitionWhere You'll See This in a Document AI ToolWhy It Matters for Your Workflow
NER (Named Entity Recognition)Identifies and categorizes key entities within document text, such as names, dates, amounts, and addressesHighlighted or tagged fields in an extraction results view, often color-coded by entity typeEnables precise extraction of structured data from unstructured text without requiring exact field labels
Document ClassificationAutomated categorization of documents by type (e.g., invoice, contract, receipt) based on content or structureA "Document Type" label assigned to each uploaded file before extraction beginsDetermines which extraction model or ruleset is applied, directly affecting accuracy
Confidence ScoreA numerical value indicating how certain the AI model is about an extracted data pointA percentage displayed next to each extracted field; low-scoring fields are flagged for human reviewAllows teams to focus manual review effort on uncertain results rather than reviewing every extraction
Data ExtractionAutomated identification and retrieval of specific fields or values from a documentThe core output of any Document AI tool—a structured list of field names and their extracted valuesReplaces manual data entry; accuracy here determines the downstream reliability of your data
Model TrainingThe process of teaching an AI system to recognize patterns in documents using labeled example dataA training interface where users upload sample documents and label the fields they want the model to learnDetermines how well the system handles your specific document types; more labeled examples generally improve accuracy

Named Entity Recognition (NER)

NER is the AI technique that identifies and categorizes key entities within document text—such as names, dates, monetary amounts, and addresses—and assigns each a semantic label. For example, a NER model reading a lease agreement would identify "March 1, 2025" as a date entity and "$3,500" as a monetary amount entity.

In Document AI, NER is particularly valuable for processing unstructured documents where data does not appear in consistent locations or labeled fields. In more complex files, extraction may also depend on multi-step document reasoning, where the system connects clues across sentences, sections, or pages before assigning the right meaning to a field.

Document Classification

Document classification is the automated process of categorizing an incoming document by type before any extraction occurs. A system might classify an uploaded file as an invoice, a purchase order, a W-2 form, or a legal contract—and then apply the appropriate extraction logic for that document type.

Classification accuracy is foundational to the rest of the pipeline. If a document is misclassified, the wrong extraction model may be applied, resulting in incorrect or missing data. For a deeper look at why this step matters so much in production workflows, it helps to review how AI document classification affects routing, model selection, and downstream accuracy.

Confidence Score

A confidence score is a numerical value—typically expressed as a percentage—that an AI model assigns to each extracted data point to indicate how certain it is about that result. A confidence score of 97% on an extracted invoice total means the model is highly certain; a score of 54% signals that the result should be reviewed by a human.

Most Document AI platforms allow users to set a confidence threshold below which results are automatically routed for manual review. This mechanism is central to balancing automation speed with data accuracy.

Data Extraction

Data extraction is the core function of any Document AI system: the automated identification and retrieval of specific fields or values from a document. For an invoice, this might include the vendor name, invoice number, line items, tax amount, and total due.

Extraction can be template-based (relying on fixed field positions) or model-based (using AI to locate fields regardless of layout variation). Model-based extraction is more flexible and is the standard approach in modern IDP platforms.

Model Training

Model training in the Document AI context is the process of teaching an AI system to recognize and extract specific patterns from documents by exposing it to labeled examples. A user uploads sample invoices, manually labels the fields of interest (e.g., "this text is the invoice number"), and the model learns to identify those fields in new, unseen documents.

The quality and quantity of training data directly affect extraction accuracy. Most platforms require a minimum number of labeled examples per document type before a model can be deployed reliably.

How Documents Move Through a Document AI Pipeline

Understanding how documents move through a Document AI system—from initial receipt to final data delivery—is essential for evaluating solutions, communicating with vendors, and diagnosing problems in a live implementation. Teams often pair a high-level glossary like this one with a more implementation-oriented reference such as the LlamaParse developer glossary when translating concepts into actual workflow design.

The table below maps each workflow stage sequentially, describing what happens at each step, the key activities involved, and the input and output at each stage.

StageDefinitionKey ActivitiesInput → Output
1. IngestionThe initial step of receiving and importing documents into a Document AI system from various sourcesConnecting to email inboxes, scanners, cloud storage, or APIs; file format detection and routingRaw document files (PDF, image, email attachment) → Documents queued for processing
2. PreprocessingPreparation of raw documents for AI processing to improve recognition accuracyImage enhancement, deskewing, noise removal, resolution normalization, page segmentationRaw or low-quality document image → Cleaned, optimized image ready for AI analysis
3. AnnotationManual labeling of document data to create training datasets for AI modelsSelecting text regions, assigning field labels, reviewing and correcting labels, exporting labeled datasetsUnlabeled sample documents → Labeled training data
4. ExtractionAutomated identification and retrieval of data fields from preprocessed documentsApplying OCR, NER, and parsing models; generating field-value pairs; assigning confidence scoresPreprocessed document → Structured data fields with confidence scores
5. ValidationVerification of extracted data against business rules or expected formatsChecking field formats (e.g., date patterns, numeric ranges), cross-referencing against master data, flagging anomaliesExtracted data fields → Verified data or flagged exceptions for human review
6. Post-ProcessingFinal formatting, enrichment, and routing of validated data to target systemsData transformation, deduplication, field mapping, API calls to ERPs or databasesValidated structured data → Delivered records in target system format

Ingestion

Ingestion is the entry point of the Document AI pipeline. It encompasses every mechanism by which documents enter the system—whether uploaded manually, pulled from a shared drive, received via email, or captured through a scanner integration.

The ingestion layer also handles file format detection and initial routing. A well-designed ingestion setup ensures that documents from diverse sources are normalized into a consistent format before any processing begins.

Preprocessing

Preprocessing addresses the reality that real-world documents are rarely perfect. Scanned pages may be skewed, faded, or low-resolution; photographs of documents may have shadows or distortion. Preprocessing applies corrective operations—such as deskewing, contrast enhancement, and noise removal—to improve the accuracy of downstream OCR and extraction.

Skipping or underinvesting in preprocessing is one of the most common causes of poor extraction accuracy. The quality of the preprocessed image directly determines the ceiling of what the AI can reliably extract.

Annotation

Annotation is the human-in-the-loop step that makes model training possible. Subject matter experts review sample documents and manually label the data fields the AI should learn to extract—for example, drawing a bounding box around an invoice number and tagging it as "invoice_number."

Because label quality has such a direct effect on model performance, it helps to think in terms of structured annotation for Document AI rather than ad hoc review. Inconsistent or incorrect labels produce models that learn the wrong patterns, leading to systematic extraction errors.

Validation

Validation is the quality control step of the pipeline. After extraction, the system checks each extracted value against predefined rules—verifying that a date field contains a valid date, that a total matches the sum of line items, or that a vendor name exists in an approved supplier list.

Results that fail validation rules are flagged for human review rather than passed downstream automatically. This step is what allows organizations to maintain data quality standards while still achieving high levels of automation.

Post-Processing

Post-processing is the final stage, where validated data is converted into the format required by the target system and delivered to its destination. This may involve mapping extracted field names to database column names, converting date formats, triggering API calls to an ERP, or generating a structured output file.

Post-processing is often underestimated in implementation planning. Even when extraction and validation work correctly, data that arrives in the wrong format or through the wrong integration pathway can cause downstream failures.

Final Thoughts

The terminology covered in this article—from foundational concepts like OCR, IDP, and document parsing, to AI/ML techniques like NER and confidence scoring, to pipeline stages like ingestion, validation, and post-processing—forms the working vocabulary of Document AI. Familiarity with these terms enables clearer communication with vendors, more accurate evaluation of platform capabilities, and faster diagnosis of issues in live implementations. Understanding how each concept connects to the others, particularly how workflow stages depend sequentially on one another, is what turns a list of definitions into a practical mental model for working with these systems.

LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.

Start building your first document agent today

PortableText [components.type] is missing "undefined"