Today, we are announcing the first of our dedicated API's for handling spreadsheets, available today in Beta for free!
Introducing LlamaSheets
Spreadsheets are everywhere. From financial models, product catalogs, and operational reports, spreadsheets exist across a wide range of formats and levels of organization.
Unlike typical unstructured documents, spreadsheets contain highly structured numerical data, complex formatting, and visual hierarchies that traditional text parsing cannot capture. LLMs and agents need to understand not just the raw cell values, but the semantic relationships, formatting patterns, and hierarchical structure encoded in these documents.
The challenge is that "messy" spreadsheets often use visual formatting (bold headers, colored cells, merged regions) to convey meaning rather than explicit data structures. Before any AI automation can happen, this normalization step,extracting structured data while preserving semantic context, is critical. That is why we’re so excited today to announce our newest product to LlamaCloud, LlamaSheets!
LlamaSheets is a new LlamaCloud API that automatically structures complex spreadsheets into AI-ready data using semantic understanding. The input is any .xlsx file, and the output is parquet files that can be used in any agent or downstream application.
Technical Approach
Our processing algorithm implements a sophisticated multi-stage pipeline:
- Feature Extraction & Clustering - 40+ features per cell are extracted (position, formatting, etc.) and are then featurized for clustering
- Intelligent Region Classification - Clusters are then classified into specific types of regions are classified using a combination of traditional ML techniques and agent-based processing
- Adaptive Table Segmentation - A scoring system evaluates boundary quality between regions and iteratively refines boundaries
- Hierarchical Structure Preservation - ****Intelligent extraction within each table is applied that preserves multi-level headers and complex table structures and preserves types where possible (dates, numbers, booleans, text)
Output Contents and Format
LlamaSheets produces multiple types of outputs, mostly as parquet files:
- Table data: Clean, typed DataFrames with preserved data types (dates, numbers, strings, booleans). Column names are intelligently extracted from header rows.
- Extra data: Data in your spreadsheet that doesn't explicitly belong in a structured data table (notes, titles, etc.)
- Cell metadata: 40+ features per cell including formatting (
font_bold,background_color_rgb), position (row_number,coordinate), data types (is_date_like,is_percentage), and layout (is_merged_cell,horizontal_alignment) - Sheet context: Optional LLM-generated titles and descriptions for each worksheet and extracted table region
Use Cases
LlamaSheets enables AI automation across diverse spreadsheet workflows. Here's just a few examples:
- Financial Analysis: Extract quarterly revenue tables from complex financial reports with merged headers and calculate KPIs automatically
- Multi-Region Data Consolidation: Parse and combine sales data from dozens of regional spreadsheets with inconsistent formatting
- Budget Parsing with Metadata: Use background colors and bold formatting to identify department groupings and category hierarchies in budget files
- Automated Weekly Reports: Build end-to-end pipelines that extract, validate, analyze, and generate reports from recurring spreadsheet uploads
- Custom Agent Integrations: Load extracted Parquet files into AI Agent frameworks (like LlamaIndex) for interactive data exploration, script generation, and more
Example: Extract and Analyze in 5 Lines
python
from llama_cloud_services.beta.sheets import LlamaSheets
client = LlamaSheets(api_key="llx-...")
results = await client.aextract_regions("budget.xlsx")
# Download as pandas DataFrame
df = await client.adownload_region_as_dataframe(
results.job_id,
results.regions[0].region_id,
result_type=regions[0].region_type
)
# Access rich cell metadata
metadata = await client.adownload_region_as_dataframe(
results.job_id,
results.regions[0].region_id,
result_type="cell_metadata"
) Available now in beta
LlamaSheets is available today in beta through multiple interfaces:
- 🧩 Playground UI: Experiment with sample spreadsheets directly in the browser at cloud.llamaindex.ai
- 💻 Python SDK: The llama-cloud-services package provides async/sync methods for uploading files, creating extraction jobs, polling for completion, and downloading Parquet results as pandas DataFrames or raw bytes
- 🌐 REST API: Four-step workflow via
/api/v1/beta/sheets/endpoints: (1) Upload file → (2) Create job with parsing config → (3) Poll for completion → (4) Download Parquet files via presigned URLs - 📘 Build Agents Integrate with any agent framework (LlamaIndex, Claude Code, Cursor, etc.) by loading extracted Parquet files and cell metadata into your agent's context
What's Next
During the beta period, we're focused on performance optimization, enhanced accuracy, additional output formats, and future API’s that build on the region and table extraction to provide more end-to-end experiences.
We encourage users to try the API, provide feedback on extraction quality, and help us prioritize features for the full release!
Let us know what you think: