Register for LlamaParse vs. LLMs: Live OCR Battleground on 3/26

Parsing the Unreadable: How LlamaParse Handles Legal Discovery Documents

1

If you've ever worked with legal documents, you might know that during litigation, one of the most time-consuming phases is called discovery (thank you Suits for making it so that I know what this means)! This is the process where both sides in a lawsuit are required to hand over relevant documents to each other. In practice, this means lawyers sifting through tens of thousands, sometimes hundreds of thousands, of files looking for the pieces of evidence that matter, a burden the U.S. federal court system itself has heard described as a "nightmare" and a "morass".

To make this manageable, legal firms rely on dedicated eDiscovery platforms like Relativity, Everlaw, and DISCO. These tools are built for exactly this workflow: ingesting large document productions, indexing them, and letting legal teams search, tag, and filter down to the documents that actually matter for a case.

The problem is that for any of that to work well, the documents need to be parsed correctly first. And the documents they're handed are, often (and often by design), hard to work with.

Discovery Documents Are Difficult to Parse

When documents are produced during discovery, the other side doesn't exactly go out of their way to make them easy to read. Files are typically scanned, not exported as native PDFs. Scans come in at low resolutions, in black and white, various rotations etc. The receiving party gets a flat image that's technically a PDF, but contains very little of the structured information you'd hope to extract from it.

The result is a mountain of documents that are nominally searchable but practically aren't. Traditional OCR tools struggle at low resolutions. When OCR does extract text, spacing errors are common: the letters might be there, but "settlement" comes out as "s ettl em ent" and your regex query finds nothing. Semantic search doesn't exist in most of the older systems legal firms rely on today. So lawyers end up writing regex queries to run against the document set. They get back a list of results (if the OCR cooperated at all) and work from there.

This is slow, fragile, and misses an enormous category of content entirely: anything visual.

The Documents Aren't Just Text

Consider what a discovery production actually contains. Yes, there are emails and memos. But there are also:

  • Photographs (potentially of people, places, or evidence of physical harm)
  • PowerPoint presentations with embedded charts and graphics
  • Tables buried in scanned reports
  • Handwritten annotations on printed documents

None of these are handled well by text-based search. If you're looking for evidence that someone misrepresented data in a slide deck, no regex query is going to surface that chart for you. If you need to go through all documents that contain photographs of a specific person, you'd need someone to manually tag every document that has a photo in it before you can even begin filtering.

This is where the parser matters at the foundation. If you're building a search or classification system on top of discovery documents, what you extract at ingestion time determines everything about what you can find later.

What LlamaParse Brings to This Problem

LlamaParse is a document parsing tool built specifically to handle the kinds of documents that break simpler tools. It also uses multimodal models under the hood, which means it doesn't just extract text. It understands the visual layout of a page (see the “items” output in the API if you’re interested), can describe images and charts, and handles the structural complexity of tables and mixed-content documents.

For legal discovery, this unlocks a few things that traditional OCR pipelines can't offer.

First, it handles low-quality scans significantly better. LlamaParse uses vision models to interpret page content rather than relying purely on pixel-level text recognition. A page that comes in blurry, skewed, or at low DPI can still yield usable, structured output.

Second, it preserves and surfaces visual content. When a page contains a photograph, LlamaParse can describe what's in it. When a page contains a chart, it can extract the data or describe what the chart represents. This is the difference between a document being invisible to your search system and being fully indexed.

Third, you can guide its behavior with custom parsing instructions. Discovery documents often follow predictable patterns: case numbers in headers, specific formatting for deposition exhibits, certain kinds of tables. You can tell LlamaParse exactly what to look for and how to structure the output.

Setting Up LlamaParse for a Discovery Document Pipeline

Let me walk through how you'd actually configure this. Start by installing the llama-cloud package and setting your API key, which you can get from cloud.llamaindex.ai:

bash

pip install llama-cloud

python

import os
from llama_cloud import AsyncLlamaCloud

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

client = AsyncLlamaCloud()


The LlamaParse API works in two steps: you first upload the file, then kick off a parse job. The expand parameter tells LlamaParse which output views to return. For a discovery pipeline, you'll typically want "markdown" for LLM-friendly structured text, "text" for plain page-by-page content, and "items" when you need the structured layout tree (useful for detecting tables and figures):

python

# Upload the document
file_obj = await client.files.create(
    file="./discovery_batch/doc_001.pdf",
    purpose="parse",
)

# Parse it
result = await client.parsing.parse(
    file_id=file_obj.id,
    tier="agentic",
    version="latest",
    expand=["markdown", "text", "items"],
)

# Access the output
for page in result.markdown.pages:
    print(page.markdown)


For discovery documents specifically, you'll almost always want to step up to tier="agentic_plus" . The higher tier is optimized for complex layouts and visual content, and the high-res OCR we use for all our tiers makes a meaningful difference on degraded scans:

python

result = await client.parsing.parse(
    file_id=file_obj.id,
    tier="agentic_plus",
    version="latest",
    expand=["markdown", "text", "items"],
)

Now, the feature I'd point you toward is custom_prompt . This lets you provide natural language guidance about what kinds of documents you're dealing with and what matters most in the output. For legal discovery, something like this goes a long way:

python

parsing_instruction = """
These are legal discovery documents produced during litigation.
They may be scanned at low resolution and appear in black and white.

For each document, please:
- Extract all visible text, correcting for common OCR artifacts like broken spacing
- Identify and describe any photographs, noting whether they contain images of people
- Extract data from any tables or charts, including chart titles and axis labels
- Note the presence of handwritten annotations separately from printed text
- Preserve any visible case numbers, bates numbers, or exhibit markers
"""

result = await client.parsing.parse(
    file_id=file_obj.id,
    tier="agentic_plus",
    version="latest",
    expand=["markdown", "text", "items"],
    output_options={
        "markdown": {
            "tables": {"output_tables_as_markdown": True}
        }
    },
    agentic_options={
		    "custom_prompt": parsing_instruction
    },
)

for page in result.markdown.pages:
    print(page.markdown[:500])


The custom_prompt field accepts plain English. You're essentially briefing the model on what it's looking at and what to pay attention to, the same way you'd brief a junior associate before handing them a box of files.

The Downstream Difference Good Parsing Makes

It's worth being direct about something: no parsing tool makes discovery easy. The volume of documents involved in major litigation is huge, and even with good tooling, review is painstaking work. What better parsing does is reduce the number of relevant documents that fall through the cracks.

If your search index is built on extracted text that's full of OCR errors, semantic search will only go so far. Your embeddings will be noisy and your recall will suffer. If your classification system has never seen the photograph in document 47,823, it can't tell you whether that photograph is relevant. The quality of everything downstream depends on what happened at the point of ingestion.

LlamaParse is most valuable here as the foundation layer. You're not asking it to do legal reasoning. You're asking it to make documents legible and structured enough that the systems built on top of it can do their jobs.

Getting Started

If you want to try this out with your own documents, you can sign up to LlamaParse and get a free tier of pages to experiment with. The LlamaParse documentation covers the full API, including all the parsing tiers and output options.

Related articles

PortableText [components.type] is missing "undefined"

Start building your first document agent today

PortableText [components.type] is missing "undefined"