Grep, Embeddings, or Both? Join us for a live webinar on June 30th to see the retrieval harness we built for agents.

Markdown Comes to LiteParse

50

A few weeks ago, we launched LiteParse 2.0 as the fastest tool for converting PDFs to text. However, a few questions kept coming up again and again: Where are the benchmarks? Does it output markdown?

LiteParse v2.1 answers this by delivering the fastest open-source, model-free, pdf-to-markdown pipeline. We measured our performance on 3 standard benchmarks and achieved top overall scores on all three when measured against model-free approaches: opendataloader-bench at 0.875, olmOCR-bench at 0.391, and ParseBench at 0.3279.

Visit the demo site (running in-browser with WASM) or install the latest version today!

bash

$ pip install liteparse
$ lit parse doc.pdf --format markdown

python

from liteparse import LiteParse

lp = LiteParse(output_format="markdown")
result = lp.parse("doc.pdf")
print(result.text)

How Does it Work?

Building a heuristic pipeline for markdown essentially boils down into two parts: signals you can detect, and the types of output elements that listen to those signals. Similar to any machine-learning model, this essentially boils down to inputs, weights, and activations!

PDFs carry a ton of data: font family, font size, text-location, and more. All of these are then treated as input signals to classify text into specific markdown elements like paragraphs, tables, lists, and headers.

LiteParse uses a custom PDFium fork to capture as much signal as possible, and then combines that with signals from our existing grid-projection algorithm, to deliver the best markdown output we can deliver with a purely heuristic rule-based approach.

As time goes on, we expect this mode to get even better. There’s an extremely long tail of PDFs that we can adapt to over time, and time is the best thing for making this mode better.

Measuring Markdown Performance

It turns out not only is markdown a highly requested output option, it's also very hard to benchmark PDF parsing tools without it.

All existing benchmarks (ParseBench, olmOCR-bench, opendataloader-bench) are strongly fit to measuring markdown. By building this markdown pipeline, we were able to deliver an entirely new output mode while also being able to measure and improve our overall extraction quality.

In the spirit of “Lite”-ness, we built the markdown mode in LiteParse to be as light and fast as possible. This approach prioritizes speed, but also has to accept an upper-bound on accuracy (we aren’t going to do better than LlamaParse with this approach).

In order to compare fairly, we scoped our comparisons to open-source tools that do not leverage larger AI models for parsing. This means OCR and other model integrations are disabled when benchmarking.

Benchmark Results

ParseBench

We’ve written a lot about ParseBench already. 2000+ documents measured across 5 key metrics that end-users actually care about. These are intentionally hard documents, so without larger AI models, these scores are actually quite impressive.

LiteParse leads Overall. The Charts and Visual Grounding columns are effectively noise for every model-free tool here. ParseBench scores charts (and parts of its layout/visual-grounding metrics) by comparing structured data extracted from the chart, which fundamentally requires an ML model to recover. A heuristic engine has nothing to emit there, so all model-free tools cluster near zero. We're reporting those columns for completeness only.

CategoryLiteParsepymupdf4llmopendataloaderpdf-inspectormarkitdown
Overall0.3280.3100.2940.2660.186
Tables0.4030.3730.3520.2660.158
Content Faithfulness0.6860.6090.6610.5610.645
Semantic Formatting0.4090.4460.3410.3510.009
Charts*0.0340.0150.0010.0530.020
Visual Grounding*0.1070.1070.1080.0990.099
  • Numbers here are mostly noise, none of the tools here output the proper data to benchmark properly on these metrics

opendataloader-bench

opendataloader-bench is a small benchmark of 200 docs. It measures three main things: Reading Order Similarity (NID), Table Structure Similarity (TEDS), and Heading-Level Similarity (MHS). You can read more about these metrics in their github repo.

Here, LiteParse leads across all categories. The official repo also reports scores from actual AI models and LiteParse is quite competitive there as well, but for this blog post we are only comparing to similar model-free OSS tools.

CategoryLiteParsepymupdf4llmopendataloaderpdf-inspectormarkitdown
Overall0.8710.7320.8310.7920.589
NID
(Reading Order)0.9080.8850.9020.8760.844
TEDS
(Tables)0.6930.4010.4830.6300.273
MHS
(Headers)0.8160.4120.7390.6020.000

olmOCR-bench

LiteParse leads in most categories in olmOCR-bench. Some of their rule checks don’t always reflect desired output, and sometimes disagree with eachother (which we’ve written about before), but it is useful signal nonetheless.

LiteParse scores well on the baseline sanity checks, and a strong showing on headers/footers, multi column, and table tests. Low scores on old scans/math are expected as these typically require OCR. The rest of the scores are within distance of other tools.

CategoryLiteParsepymupdf4llmopendataloaderpdf-inspectormarkitdown
Overall39.2%32.9%32.7%30.5%28.7%
baseline99.9%84.5%86.9%82.9%86.8%
headers_footers55.9%39.5%37.5%52.0%38.8%
multi_column67.1%66.7%62.8%38.2%39.3%
table_tests48.0%46.3%25.7%40.0%19.9%
long_tiny_text29.2%12.7%35.1%17.4%31.2%
old_scans13.3%13.3%13.3%13.3%13.3%
arxiv_math0.0%0.0%0.0%0.0%0.0%
old_scans_math0.0%0.0%0.0%0.0%0.0%

Speed Tests

Speed was measured on a fixed set of PDFs with varying layouts, page counts, and content types. The times reported are the average time taken to process a single page across the entire test set. You can find the source data here and the benchmark code here.

Providerms/page (agg)
liteparse3.16 ms
pdf-inspector3.83 ms
opendataloader66.3 ms
pymupdf4llm-md141.5 ms
markitdown182.5 ms

Licensing & Portability

Across all the tools tested, there is a mix of licenses and supported runtimes.

LiteParse is permissively licensed (Apache-2.0) and runs as a single engine across four ecosystems, including natively in the browser via WASM. The Python-only tools can't go where a browser or a Node service needs them, and pymupdf4llm inherits PyMuPDF's AGPL-3.0 copyleft, which is a non-starter for many commercial codebases without a paid license.

ToolLicenseLanguages / Runtimes
LiteParseApache-2.0Rust, Python, Node, WASM (browser)
pymupdf4llmAGPL-3.0 (commercial available)Python
markitdownMITPython
opendataloaderApache-2.0Java core (+ Python, Node.js wrappers)
pdf-inspectorMITRust

A Note on v2.1 Scope

These three benchmarks don't always agree on what "good" markdown looks like. We repeatedly found that tuning output to win one benchmark (e.g. olmOCR-bench) would regress another (e.g. ParseBench), and vice versa. Visually inspecting PDFs you’d often see results that “score well” but visually looked not great. Rather than benchmaxxing any single harness, we kept v2.1 tuned for solid, balanced performance across all three. There's plenty of headroom to push individual sub-categories over time (and we will!).

Try it Today!

LiteParse runs everywhere and v2.1 is available now:

bash

# Node Library + CLI
npm i @llamaindex/liteparse

# Python Library + CLI
pip install liteparse

# Rust Library + CLI
cargo install liteparse

# WASM Library
npm i @llamaindex/liteparse-wasm

Or, use it with your favourite coding agent directly as a skill:

bash

# Claude Code, Codex, OpenCode, etc.
npx skills add run-llama/llamaparse-agent-skills --skill liteparse

# Pi Coding Agent Extension
pi install npm:@llamaindex/liteparse-pi-extension@latest

Follow these links for docs and details on source code:

Related articles

PortableText [components.type] is missing "undefined"

Start building your first document agent today

PortableText [components.type] is missing "undefined"