> table-extractor

Extract tables from PDFs with high accuracy using camelot. Handles complex table structures including merged cells, multi-line rows, and spanning headers. Use when a user asks to extract a table from a PDF, pull tabular data from a document, convert PDF tables to CSV or Excel, or parse structured tables from reports.

fetch

$curl "https://skillshub.wtf/TerminalSkills/skills/table-extractor?format=md"

SKILL.md•table-extractor

Table Extractor

Overview

Extract tables from PDF documents with high accuracy using camelot-py. Handles complex table structures including merged cells, multi-line rows, spanning headers, and borderless tables. Outputs clean DataFrames that can be exported to CSV, Excel, or JSON.

Instructions

When a user asks you to extract tables from a PDF, follow this process:

Step 1: Install and verify dependencies

# Install camelot and its dependencies
pip install "camelot-py[base]" ghostscript opencv-python-headless pandas

# Verify ghostscript is available (required by camelot)
gs --version 2>/dev/null || echo "Install ghostscript: sudo apt install ghostscript"

If ghostscript is not available, fall back to pdfplumber:

pip install pdfplumber pandas

Step 2: Inspect the PDF to locate tables

import camelot

# Quick scan: how many tables are in the document?
tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")
print(f"Found {len(tables)} tables using lattice detection")

# If no tables found, try stream detection (for borderless tables)
if len(tables) == 0:
    tables = camelot.read_pdf("document.pdf", pages="all", flavor="stream")
    print(f"Found {len(tables)} tables using stream detection")

# Summary of each table
for i, table in enumerate(tables):
    print(f"\nTable {i}: {table.shape[0]} rows x {table.shape[1]} cols (page {table.page})")
    print(f"Accuracy: {table.accuracy:.1f}%")
    print(table.df.head(3))

Step 3: Choose the right extraction flavor

Lattice flavor (for tables with visible borders/gridlines):

tables = camelot.read_pdf(
    "document.pdf",
    pages="1,2,3",        # Specific pages
    flavor="lattice",
    line_scale=40,         # Adjust line detection sensitivity
    process_background=True # Detect lines on colored backgrounds
)

Stream flavor (for borderless tables, whitespace-separated):

tables = camelot.read_pdf(
    "document.pdf",
    pages="1",
    flavor="stream",
    edge_tol=50,          # Tolerance for edge detection
    row_tol=10,           # Tolerance for grouping text into rows
    columns=["72,200,350,500"]  # Manual column boundaries if auto-detect fails
)

Step 4: Clean and process extracted tables

import pandas as pd

for i, table in enumerate(tables):
    df = table.df

    # Promote first row to header if it contains column names
    if df.iloc[0].str.match(r'^[A-Za-z]').all():
        df.columns = df.iloc[0]
        df = df[1:].reset_index(drop=True)

    # Clean whitespace and newlines within cells
    df = df.apply(lambda col: col.str.strip().str.replace(r'\n', ' ', regex=True))

    # Remove completely empty rows
    df = df.dropna(how='all').replace('', pd.NA).dropna(how='all')

    # Convert numeric columns
    for col in df.columns:
        try:
            df[col] = pd.to_numeric(df[col].str.replace(',', '').str.replace('$', ''))
        except (ValueError, AttributeError):
            pass  # Keep as string

    print(f"\nCleaned Table {i}:")
    print(df.head())

Step 5: Handle complex table structures

Merged cells and spanning headers:

# Forward-fill merged cells (common in row headers)
df.iloc[:, 0] = df.iloc[:, 0].replace('', pd.NA).ffill()

# Handle multi-level column headers
if df.iloc[0:2].apply(lambda x: x.str.len().mean()).mean() < 20:
    # Combine first two rows as multi-level header
    new_cols = df.iloc[0] + " - " + df.iloc[1]
    df.columns = new_cols.str.strip(" - ")
    df = df[2:].reset_index(drop=True)

Tables spanning multiple pages:

# Extract from all pages and concatenate
all_tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")

# Group tables that are continuations (same column count)
groups = {}
for t in all_tables:
    key = t.shape[1]
    groups.setdefault(key, []).append(t.df)

for col_count, dfs in groups.items():
    combined = pd.concat(dfs, ignore_index=True)
    # Remove duplicate header rows that appear at page breaks
    combined = combined[~combined.duplicated(keep='first')]

Step 6: Export the results

# CSV (one file per table)
for i, table in enumerate(tables):
    table.df.to_csv(f"table_{i+1}.csv", index=False)

# Excel (all tables as separate sheets)
with pd.ExcelWriter("extracted_tables.xlsx") as writer:
    for i, table in enumerate(tables):
        table.df.to_excel(writer, sheet_name=f"Table_{i+1}", index=False)

# JSON
for i, table in enumerate(tables):
    table.df.to_json(f"table_{i+1}.json", orient="records", indent=2)

print(f"Exported {len(tables)} tables")

Examples

Example 1: Extract financial tables from an annual report

User request: "Extract all tables from this annual report PDF"

Actions:

Scan all pages with lattice flavor (financial reports typically have bordered tables)
Identify income statement, balance sheet, and cash flow tables by column headers
Clean numeric values (remove $, commas, parentheses for negatives)
Export each table to a separate CSV and combine into one Excel workbook

Output: "Extracted 7 tables across 42 pages. Exported to extracted_tables.xlsx with sheets: Income_Statement, Balance_Sheet, Cash_Flow, Revenue_Breakdown, Expenses, Quarterly_Summary, KPIs."

Example 2: Extract a specific table from a research paper

User request: "Get the results table from page 8 of this paper"

Actions:

Target page 8 specifically: camelot.read_pdf("paper.pdf", pages="8")
If multiple tables on the page, show summaries and let the user pick
Clean the extracted table and handle any multi-line cells
Export as CSV

Output: A single CSV file with the results table, plus a preview of the first few rows printed to the console.

Example 3: Batch process multiple PDFs

User request: "Extract the summary table from each of these 20 monthly reports"

Actions:

import glob

results = []
for pdf_path in sorted(glob.glob("reports/*.pdf")):
    tables = camelot.read_pdf(pdf_path, pages="1", flavor="lattice")
    if tables:
        df = tables[0].df  # First table on first page
        df["source_file"] = pdf_path
        results.append(df)

combined = pd.concat(results, ignore_index=True)
combined.to_csv("all_summaries.csv", index=False)

Output: A single CSV combining the summary table from all 20 reports with a source_file column for traceability.

Guidelines

Always try lattice flavor first (bordered tables). Fall back to stream for borderless tables.
Check the accuracy score on each table. Below 80% indicates extraction issues that need manual review.
For scanned PDFs, run OCR first (e.g., ocrmypdf) before table extraction.
When camelot struggles, try pdfplumber as an alternative: page.extract_table(table_settings={...}).
Clean numeric data aggressively: remove currency symbols, commas, and handle parenthesized negatives.
For tables with merged cells, use forward-fill on the appropriate columns.
When extracting from multiple pages, watch for repeated header rows at page breaks.
Always preview the extracted data before exporting to catch alignment or parsing issues.
Report extraction quality metrics (accuracy, row/column count) so the user can verify correctness.

> related_skills --same-repo

> zustand

You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.

> zod

You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.

> xero-accounting

Integrate with the Xero accounting API to sync invoices, expenses, bank transactions, and contacts — and generate financial reports like P&L and balance sheet. Use when: connecting apps to Xero, automating bookkeeping workflows, syncing accounting data, or pulling financial reports programmatically.

> windsurf-rules

Configure Windsurf AI coding assistant with .windsurfrules and workspace rules. Use when: customizing Windsurf for a project, setting AI coding standards, creating team-shared Windsurf configurations, or tuning Cascade AI behavior.

┌ stats

installs/wk0

░░░░░░░░░░

github stars76

██████████

first seenMar 17, 2026

└────────────

┌ repo

TerminalSkills/skills

by TerminalSkills

└────────────

┌ tags

#pdf

└────────────