> data-extractor

Extract structured data from any document format using unified document processing. Use when a user asks to extract data from a document, parse a PDF, pull structured data from files, convert documents to JSON or CSV, extract fields from invoices or forms, or scrape data from documents.

fetch

$curl "https://skillshub.wtf/TerminalSkills/skills/data-extractor?format=md"

SKILL.md•data-extractor

Data Extractor

Overview

Extract structured data from documents in any format: PDF, DOCX, HTML, TXT, images, and more. Converts unstructured or semi-structured content into clean JSON, CSV, or other structured formats. Handles invoices, forms, reports, and free-text documents.

Instructions

When a user asks you to extract data from a document, follow this process:

Step 1: Identify the document format and install dependencies

# Determine file type
file document.pdf

# Install dependencies based on format
pip install pdfplumber python-docx beautifulsoup4 lxml openpyxl

Library selection by format:

PDF: pdfplumber (text + tables), PyMuPDF (fitz) for complex layouts
DOCX: python-docx
HTML: beautifulsoup4 with lxml
Excel: openpyxl or pandas
Images: pytesseract (OCR) with Pillow
JSON/XML: Python standard library

Step 2: Extract raw content

PDF extraction:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text()
        print(f"--- Page {i+1} ---")
        print(text)

        # Extract tables if present
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

DOCX extraction:

from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(f"[{para.style.name}] {para.text}")

# Extract tables
for table in doc.tables:
    for row in table.rows:
        print([cell.text for cell in row.cells])

HTML extraction:

from bs4 import BeautifulSoup

with open("document.html") as f:
    soup = BeautifulSoup(f, "lxml")

# Extract specific elements
for table in soup.find_all("table"):
    rows = table.find_all("tr")
    for row in rows:
        cells = [td.get_text(strip=True) for td in row.find_all(["td", "th"])]
        print(cells)

Step 3: Parse and structure the data

Once you have raw text, extract the target fields:

Pattern-based extraction:

import re
import json

text = "..."  # extracted text

# Define patterns for common fields
patterns = {
    "invoice_number": r"Invoice\s*#?\s*:?\s*(\w+[-/]?\w+)",
    "date": r"Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})",
    "total": r"Total\s*:?\s*\$?([\d,]+\.?\d*)",
    "email": r"[\w.-]+@[\w.-]+\.\w+",
}

extracted = {}
for field, pattern in patterns.items():
    match = re.search(pattern, text, re.IGNORECASE)
    if match:
        extracted[field] = match.group(1) if match.lastindex else match.group(0)

print(json.dumps(extracted, indent=2))

Line-item extraction from tables:

import pandas as pd

# From a list of table rows
headers = table_data[0]
rows = table_data[1:]
df = pd.DataFrame(rows, columns=headers)

# Clean up
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df = df.dropna(how="all")

Step 4: Validate and clean the output

# Type conversion
extracted["total"] = float(extracted["total"].replace(",", ""))

# Date normalization
from datetime import datetime
extracted["date"] = datetime.strptime(extracted["date"], "%m/%d/%Y").isoformat()

# Validate required fields
required = ["invoice_number", "date", "total"]
missing = [f for f in required if f not in extracted]
if missing:
    print(f"Warning: missing fields: {missing}")

Step 5: Output in the desired format

# JSON output
with open("extracted_data.json", "w") as f:
    json.dump(extracted, f, indent=2)

# CSV output
df.to_csv("extracted_items.csv", index=False)

# Pretty print summary
print(f"Extracted {len(extracted)} fields from document")
print(f"Line items: {len(df)} rows")

Examples

Example 1: Extract invoice data from a PDF

User request: "Extract the invoice details from this PDF"

Actions:

Open the PDF with pdfplumber and extract text
Use regex patterns to find invoice number, date, vendor, subtotal, tax, total
Extract the line items table into a DataFrame
Output a JSON file with header fields and a CSV with line items

Output:

{
  "invoice_number": "INV-2025-0042",
  "date": "2025-03-15",
  "vendor": "Acme Corp",
  "subtotal": 1250.00,
  "tax": 100.00,
  "total": 1350.00,
  "line_items": [
    {"description": "Widget A", "qty": 10, "unit_price": 75.00, "amount": 750.00},
    {"description": "Widget B", "qty": 5, "unit_price": 100.00, "amount": 500.00}
  ]
}

Example 2: Extract contacts from a DOCX directory

User request: "Pull all names and email addresses from this company directory document"

Actions:

Parse the DOCX file, iterate through paragraphs and tables
Use regex to find email addresses and associated names
Deduplicate and output as CSV

Output: A CSV file with columns: name, email, department, phone.

Example 3: Convert an HTML report to structured data

User request: "Extract the quarterly results table from this HTML page"

Actions:

Parse the HTML with BeautifulSoup
Find the target table by heading or class
Extract headers and rows into a DataFrame
Clean column names and convert numeric values
Export as CSV and provide summary statistics

Output: A clean CSV with quarterly metrics and a summary of key figures.

Guidelines

Always inspect the raw extracted text before writing parsers. Understanding the layout saves time.
Use pdfplumber for most PDF extraction. Fall back to PyMuPDF for complex multi-column layouts.
For scanned PDFs (image-based), use OCR with pytesseract before parsing.
Validate extracted data types: convert strings to numbers, normalize dates.
Report extraction confidence: note any fields that could not be found or seem incorrect.
Handle multi-page documents by accumulating results across pages.
For batch extraction (many documents of the same type), build a reusable extraction function and apply it across all files.
Always preserve the original document alongside extracted data for verification.
When patterns fail, fall back to positional extraction based on text layout.

> related_skills --same-repo

> zustand

You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.

> zod

You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.

> xero-accounting

Integrate with the Xero accounting API to sync invoices, expenses, bank transactions, and contacts — and generate financial reports like P&L and balance sheet. Use when: connecting apps to Xero, automating bookkeeping workflows, syncing accounting data, or pulling financial reports programmatically.

> windsurf-rules

Configure Windsurf AI coding assistant with .windsurfrules and workspace rules. Use when: customizing Windsurf for a project, setting AI coding standards, creating team-shared Windsurf configurations, or tuning Cascade AI behavior.

┌ stats

installs/wk0

░░░░░░░░░░

github stars38

████████░░

first seenMar 17, 2026

└────────────

┌ repo

TerminalSkills/skills

by TerminalSkills

└────────────

┌ tags

#pdf

└────────────