Live Webinar 5/27: Dive into ParseBench and learn what it takes to evaluate document OCR for AI Agents

Building a Financial Document Pipeline with LlamaParse

Loan underwriting requires pulling data from multiple financial documents. This often includes pay stubs and brokerage statements, all with complex layouts that will vary widely across providers. This is a key financial workflow that often incurs heavy manual checks and repetitive processes.

Last week I (Logan, Head of OSS at LlamaIndex) ran a hands-on workshop in NYC where developers built a loan underwriting pipeline from scratch using LlamaParse tools. The resulting application was able to take messy financial PDFs (pay stubs, brokerage statements), extract structured data, and run cross-document analysis.

We wanted to build a pipeline that:

  1. Parses PDFs into clean markdown using LlamaParse's agentic tier
  2. Extracts structured fields (employer name, gross pay, holdings, account values) into typed Pydantic models
  3. Analyzes data across documents to produce an underwriting summary with discrepancy flags
  4. Reviews the analysis with a human-in-the-loop approval step

This post walks through what we built and how you can try it yourself.

The Workshop Tech Stack

The stack is intentionally simple for a workshop setting. Using a combination of async Python, SQLite, FastAPI, Pydantic, and the LlamaCloud SDK, we built a fully async pipeline with in-memory job queues.

While the tech stack is simple, the architecture is designed to be extensible. You (or your coding agent) can easily swap out components as needed. This means swapping in Celery or Temporal for job queues, Postgres for the database, and S3 instead of local file storage.

LlamaParse Three Ways

The workshop had attendees implement three service files, each using LlamaParse in different ways.

1. Parsing: PDF to Markdown

The first service uploads a PDF and gets back clean markdown. This is LlamaParse's core capability, its agentic parsing tier handles the messy table layouts and formatting inconsistencies across payroll providers and brokerages.

python

import asyncio
from llama_cloud import AsyncLlamaCloud

client = AsyncLlamaCloud(api_key=settings.llama_cloud_api_key)

file_obj = await client.files.create(file=file_path, purpose="parse")
job = await client.parsing.create(file_id=file_obj.id, tier="agentic", version="latest")

# Poll until complete
result = await client.parsing.get(job.id, expand=["markdown_full"])
while result.job.status not in ("COMPLETED", "FAILED", "CANCELLED"):
    await asyncio.sleep(3)
    result = await client.parsing.get(job.id, expand=["markdown_full"])

parsed_markdown = result.markdown_full

Three API calls: upload the file, create a job, poll for the result. The markdown that comes back preserves table structure, which is critical for the next step.

2. Extraction: Markdown to Structured Data

The second service takes a parsed document and extracts typed fields using a Pydantic schema. You define what you want, and LlamaParse pulls it out.

For example, the pay stub schema:

python

from pydantic import BaseModel

class PayStub(BaseModel):
    employer_name: str
    employee_name: str
    pay_period_start: str
    pay_period_end: str
    gross_pay: float
    net_pay: float
    ytd_gross_income: float
    deductions: list[Deduction]

The extraction call passes the schema as JSON Schema to LlamaParse:

python

job = await client.extract.create(
    file_input=file_obj.id,
    configuration={
        "tier": "agentic",
        "data_schema": PayStub.model_json_schema(),
    }
)

Similar to parsing, you upload a file, use the file ID, and then poll for the result.

Once the job completes, you can then validate it against your schema using PayStub.model_validate(result.extract_result).

3. Cross-Document Analysis

The third service is the most interesting. It takes the extracted data from multiple documents (say, a pay stub and a brokerage statement), combines them into a text buffer, uploads that buffer to LlamaParse, and runs extraction again. Except this time, with an underwriting summary schema that looks across all documents that performs more reasoning rather than pure extraction.

python

# Combine extracted data into a text document
text = _format_extractions_as_text(extracted_data)

# Upload as a buffer file
file_obj = await client.files.create(
    file=(f"review_{review_id}.txt", io.BytesIO(text.encode("utf-8"))),
    purpose="extract",
)

# Extract with the cross-document schema
job = await client.extract.create(
    file_input=file_obj.id,
    configuration={
        "tier": "agentic",
        "data_schema": UnderwritingSummary.model_json_schema(),
        "system_prompt": "You are a loan underwriter. Analyze ...",
    }
)

The underwriting summary schema asks for verified income, total liquid assets, months of reserves, and a list of discrepancies with severity ratings. By setting the system prompt, we can explicitly prompt the service to perform the kind of analysis we want across the documents, rather than just pulling out fields. This is where business-specific knowledge can be injected into the pipeline to produce more actionable outputs.

Try It Yourself

First, grab an API key from LlamaCloud if you don't have one already.

The repo is set up so you can implement each service incrementally. Each phase has a branch with the TODO stubs filled in:

BranchWhat's implemented
mainStarting point with 3 services to implement
phase_1Parser service (PDF to markdown)
phase_2+ Extraction service (structured data)
phase_3+ Review service (cross-document analysis)

To get started:

bash

git clone <https://github.com/logan-markewich/finparse-pipeline> && cd finparse-pipeline
git checkout phase_1  # start with TODO stubs
uv sync --group dev
cp .env.example .env  # add your LLAMA_CLOUD_API_KEY
uv run fastapi dev app/main.py

The Swagger UI at http://localhost:8000/docs lets you drive the whole flow: upload a PDF, poll for parsing, trigger extraction, create a review.

Related articles

PortableText [components.type] is missing "undefined"

Start building your first document agent today

PortableText [components.type] is missing "undefined"