Signup to LlamaParse for 10k free credits!

PDF Parsing Supported File Types

PDF parsing is a deceptively complex process. While all PDF files share the .pdf extension, their internal structure varies enormously — and those structural differences determine whether a parser can extract content reliably, needs additional processing steps, or fails entirely. Teams evaluating top document parsing APIs often find that parser performance depends less on the file extension and more on the document’s internal structure, encoding, and layout.

That is why choosing the right PDF parsing API matters before building a production workflow. Understanding which file types a PDF parser supports, and why certain formats behave differently, is essential for avoiding unexpected errors and improving processing consistency at scale.

PDF Format Subtypes and Version Compatibility

PDF parsers do not treat all .pdf files equally. The format covers a broad family of subtypes, specification versions, and content types — each with distinct structural characteristics that affect how content is extracted. Before processing begins, it helps to review the supported document types and understand how modern PDF parsing workflows handle archival, engineering, and form-based variants differently.

The following table provides a quick-reference summary of commonly supported PDF formats, their subtypes, version compatibility, and text extraction method.

File Type / FormatSubtype or VariantSupported Version RangeText Extraction MethodSupport StatusNotes / Caveats
Standard PDF (text-based)PDF 1.0–2.0Native text extractionFully SupportedMost reliable parsing output
PDF/A (Archival)PDF/A-1a, PDF/A-1bPDF 1.4+Native text extractionFully SupportedStrict compliance aids consistency
PDF/A (Archival)PDF/A-2a, PDF/A-2b, PDF/A-2uPDF 1.7+Native text extractionFully SupportedSupports embedded files
PDF/A (Archival)PDF/A-3a, PDF/A-3b, PDF/A-3uPDF 1.7+Native text extractionFully SupportedAllows arbitrary embedded files
PDF/E (Engineering)PDF/E-1PDF 1.6+Native text extractionFully SupportedCommon in CAD/technical drawings
Fillable PDF / AcroFormStandard AcroFormPDF 1.2+Native field extractionFully SupportedForm field data extracted separately from body text
XFA-Based FormsStatic XFA, Dynamic XFAPDF 1.5+Native or structured extractionSupported with LimitationsDynamic XFA may produce incomplete output
Scanned / Image-Based PDFPDF 1.0–2.0OCR requiredRequires Additional StepAccuracy depends on scan quality and resolution

Native Text-Based vs. Image-Based PDFs

A critical distinction exists between two broad categories of PDF content.

Native/text-based PDFs contain an embedded text layer created directly by software such as a word processor or design application. Text can be extracted programmatically without any image interpretation.

Image-based/scanned PDFs contain pages rendered as raster images. There is no embedded text layer — the content exists only as pixel data, and OCR processing is required to extract readable text. This distinction is central to accurate PDF text extraction, because extraction quality depends first on whether the parser is working with encoded text or only visual page content.

Both types use the .pdf extension and appear visually identical to the end user, which is a frequent source of confusion. Identifying which type you are working with before parsing is an important first step in any document processing workflow.

Scanned PDFs and OCR Requirements

Scanned PDFs represent one of the most common challenges in document parsing. Because their content is stored as images rather than text, they require a fundamentally different processing approach — and understanding this distinction helps set accurate expectations for output quality and processing time. As discussions of beyond OCR in PDF parsing make clear, image-based files often require layout understanding in addition to basic character recognition.

The following table compares native text-based PDFs and scanned image-based PDFs across the dimensions most relevant to parsing workflows.

CharacteristicNative / Text-Based PDFScanned / Image-Based PDF
How text is storedEmbedded as selectable, machine-readable textRendered as raster image pixels
OCR requiredNoYes
Where OCR processing occursN/ABuilt-in OCR engine or external preprocessing step
Parsing speedFaster — direct text extractionSlower — image analysis adds processing time
Output accuracyHigh — text is extracted exactly as encodedVariable — depends on scan resolution, image quality, and font clarity
Formatting / layout preservationGenerally reliableLimited — complex layouts (columns, tables) may not reconstruct accurately
Typical use casesDigitally created reports, contracts, forms, presentationsPrinted documents that have been physically scanned, legacy archives, faxed records

When OCR Processing Is Necessary

OCR becomes necessary any time the PDF does not contain a selectable text layer. Common scenarios include:

  • Documents scanned from physical paper using a flatbed or multifunction scanner
  • Faxed documents saved as PDFs
  • Legacy documents digitized from microfilm or microfiche
  • PDFs exported from image editing software without text embedding

Some parsers include a built-in OCR engine that activates automatically when no text layer is detected. Others require OCR to be explicitly enabled as a separate processing step, or performed as a preprocessing stage before the file is submitted for parsing. If you are building this into an application, one practical way to parse PDF files in TypeScript is to check for an embedded text layer first and then decide whether OCR should run.

OCR Accuracy and Formatting Constraints

Even with OCR enabled, scanned PDFs carry inherent accuracy constraints that text-based PDFs do not:

  • Scan quality directly affects output. Low-resolution scans (below 150 DPI), skewed pages, poor contrast, or handwritten content will reduce OCR accuracy.
  • Complex layouts are harder to reconstruct. Multi-column text, tables, headers, and footers may not parse into correctly ordered or structured output.
  • Special characters and non-Latin scripts may be misrecognized depending on the OCR engine's language support.
  • Formatting metadata (font size, bold, italic) is generally not recoverable from scanned content.

For best results with scanned PDFs, use source files scanned at 300 DPI or higher, with straight page alignment and high contrast between text and background. For more demanding documents, methods focused on extracting sections, headings, paragraphs, and tables are especially useful because they address structural reconstruction, not just character recognition.

Unsupported or Problematic PDF Conditions

Not all PDF files can be parsed successfully, even when they appear structurally valid. Certain conditions — whether intentional security measures or file integrity issues — prevent parsers from accessing or interpreting content. The table below catalogs the most common unsupported conditions, explains why they cause failures, and provides resolution guidance.

File Type / ConditionWhy Parsing FailsExpected Error or BehaviorRecommended Resolution / WorkaroundResolvable by User?
Password-Protected PDF (open password)Encryption blocks all content access until the correct password is suppliedError message; no content extractedSupply the password if the parser supports it, or decrypt the file before uploading using the document owner's credentialsYes
Owner-Restricted PDF (permissions lock)Permissions flags restrict copying or extraction even without an open passwordPartial or empty extraction; no error in some parsersRequest an unrestricted version from the document owner or re-export from the source applicationPartial
DRM-Encrypted PDFDigital rights management encryption prevents any programmatic content accessError message or complete extraction failureDRM cannot be removed by the user; request a DRM-free version from the content providerNo
Corrupted or Damaged PDFFile structure is incomplete or unreadable due to incomplete download, storage errors, or file transfer corruptionParser error; crash or empty outputRe-download or re-export the file from its original source; validate file integrity before resubmittingYes
Non-Standard or Malformed PDFFile does not conform to the PDF specification; internal structure is invalid or uses unsupported proprietary extensionsUnpredictable output; partial extraction or errorRe-export from the source application using standard PDF export settings; avoid third-party PDF converters that produce non-compliant outputYes
PDF with Embedded Multimedia OnlyPages contain only video, audio, or interactive elements with no text or image layerEmpty output; no extractable contentExtract any accompanying text assets separately; multimedia content cannot be parsed as textNo
Very Early or Non-Compliant PDF VersionsPre-1.0 or proprietary pre-standard formats may use structures not recognized by modern parsersError or empty outputConvert to a standard PDF version (1.4 or later) using a compliant PDF applicationYes
Scanned PDF without OCR EnabledImage-only pages contain no text layer; parser has no text to extractEmpty output with no error in some parsersEnable OCR processing in the parser settings, or preprocess the file through an OCR tool before submissionYes

How to Resolve Common Parsing Failures

When a file falls into one of the problematic categories above, these general approaches apply:

  1. Re-export from the source application. If you have access to the original document (Word, InDesign, AutoCAD, etc.), export a fresh PDF using standard settings. This resolves most malformed and non-compliant file issues.
  2. Remove security restrictions before parsing. Password protection and permissions locks must be addressed at the file level before submission. Most PDF editors allow authorized users to remove these restrictions.
  3. Validate file integrity. Use a PDF validation tool — such as Adobe Acrobat's preflight feature or an online PDF validator — to confirm the file is structurally sound before attempting to parse it.
  4. Enable OCR for image-based files. If your parser supports OCR, ensure it is activated for scanned documents. If not, preprocess the file through a dedicated OCR application first.
  5. Contact the document owner for DRM-restricted files. DRM encryption is a hard limitation that cannot be resolved through parser configuration — the restriction must be lifted at the source.

These edge cases are also a useful benchmark when comparing the best document parsing software, since real-world reliability depends on how well a system handles damaged, restricted, and image-heavy files rather than only clean sample inputs.

Final Thoughts

Understanding the full range of PDF file types — from standard text-based documents and archival PDF/A formats to scanned image files and encrypted PDFs — is foundational to building reliable document processing workflows. The distinction between native and scanned PDFs is particularly important, as it determines whether OCR processing is required and directly affects output accuracy and formatting fidelity. In environments with stricter data-handling requirements, some teams also evaluate local document parsing as part of their broader preprocessing strategy.

LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, with industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates than legacy solutions. LlamaParse uses specialized document understanding agents working together for strong real-world document intelligence, outputting structured Markdown, JSON, or HTML. It’s free to try today and includes 10,000 free credits upon signup.

Start building your first document agent today

PortableText [components.type] is missing "undefined"