> data-extractor
Extract structured data from any document format using unified document processing. Use when a user asks to extract data from a document, parse a PDF, pull structured data from files, convert documents to JSON or CSV, extract fields from invoices or forms, or scrape data from documents.
curl "https://skillshub.wtf/TerminalSkills/skills/data-extractor?format=md"Data Extractor
Overview
Extract structured data from documents in any format: PDF, DOCX, HTML, TXT, images, and more. Converts unstructured or semi-structured content into clean JSON, CSV, or other structured formats. Handles invoices, forms, reports, and free-text documents.
Instructions
When a user asks you to extract data from a document, follow this process:
Step 1: Identify the document format and install dependencies
# Determine file type
file document.pdf
# Install dependencies based on format
pip install pdfplumber python-docx beautifulsoup4 lxml openpyxl
Library selection by format:
- PDF:
pdfplumber(text + tables),PyMuPDF(fitz) for complex layouts - DOCX:
python-docx - HTML:
beautifulsoup4withlxml - Excel:
openpyxlorpandas - Images:
pytesseract(OCR) withPillow - JSON/XML: Python standard library
Step 2: Extract raw content
PDF extraction:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
text = page.extract_text()
print(f"--- Page {i+1} ---")
print(text)
# Extract tables if present
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
DOCX extraction:
from docx import Document
doc = Document("document.docx")
for para in doc.paragraphs:
print(f"[{para.style.name}] {para.text}")
# Extract tables
for table in doc.tables:
for row in table.rows:
print([cell.text for cell in row.cells])
HTML extraction:
from bs4 import BeautifulSoup
with open("document.html") as f:
soup = BeautifulSoup(f, "lxml")
# Extract specific elements
for table in soup.find_all("table"):
rows = table.find_all("tr")
for row in rows:
cells = [td.get_text(strip=True) for td in row.find_all(["td", "th"])]
print(cells)
Step 3: Parse and structure the data
Once you have raw text, extract the target fields:
Pattern-based extraction:
import re
import json
text = "..." # extracted text
# Define patterns for common fields
patterns = {
"invoice_number": r"Invoice\s*#?\s*:?\s*(\w+[-/]?\w+)",
"date": r"Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})",
"total": r"Total\s*:?\s*\$?([\d,]+\.?\d*)",
"email": r"[\w.-]+@[\w.-]+\.\w+",
}
extracted = {}
for field, pattern in patterns.items():
match = re.search(pattern, text, re.IGNORECASE)
if match:
extracted[field] = match.group(1) if match.lastindex else match.group(0)
print(json.dumps(extracted, indent=2))
Line-item extraction from tables:
import pandas as pd
# From a list of table rows
headers = table_data[0]
rows = table_data[1:]
df = pd.DataFrame(rows, columns=headers)
# Clean up
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df = df.dropna(how="all")
Step 4: Validate and clean the output
# Type conversion
extracted["total"] = float(extracted["total"].replace(",", ""))
# Date normalization
from datetime import datetime
extracted["date"] = datetime.strptime(extracted["date"], "%m/%d/%Y").isoformat()
# Validate required fields
required = ["invoice_number", "date", "total"]
missing = [f for f in required if f not in extracted]
if missing:
print(f"Warning: missing fields: {missing}")
Step 5: Output in the desired format
# JSON output
with open("extracted_data.json", "w") as f:
json.dump(extracted, f, indent=2)
# CSV output
df.to_csv("extracted_items.csv", index=False)
# Pretty print summary
print(f"Extracted {len(extracted)} fields from document")
print(f"Line items: {len(df)} rows")
Examples
Example 1: Extract invoice data from a PDF
User request: "Extract the invoice details from this PDF"
Actions:
- Open the PDF with pdfplumber and extract text
- Use regex patterns to find invoice number, date, vendor, subtotal, tax, total
- Extract the line items table into a DataFrame
- Output a JSON file with header fields and a CSV with line items
Output:
{
"invoice_number": "INV-2025-0042",
"date": "2025-03-15",
"vendor": "Acme Corp",
"subtotal": 1250.00,
"tax": 100.00,
"total": 1350.00,
"line_items": [
{"description": "Widget A", "qty": 10, "unit_price": 75.00, "amount": 750.00},
{"description": "Widget B", "qty": 5, "unit_price": 100.00, "amount": 500.00}
]
}
Example 2: Extract contacts from a DOCX directory
User request: "Pull all names and email addresses from this company directory document"
Actions:
- Parse the DOCX file, iterate through paragraphs and tables
- Use regex to find email addresses and associated names
- Deduplicate and output as CSV
Output: A CSV file with columns: name, email, department, phone.
Example 3: Convert an HTML report to structured data
User request: "Extract the quarterly results table from this HTML page"
Actions:
- Parse the HTML with BeautifulSoup
- Find the target table by heading or class
- Extract headers and rows into a DataFrame
- Clean column names and convert numeric values
- Export as CSV and provide summary statistics
Output: A clean CSV with quarterly metrics and a summary of key figures.
Guidelines
- Always inspect the raw extracted text before writing parsers. Understanding the layout saves time.
- Use pdfplumber for most PDF extraction. Fall back to PyMuPDF for complex multi-column layouts.
- For scanned PDFs (image-based), use OCR with pytesseract before parsing.
- Validate extracted data types: convert strings to numbers, normalize dates.
- Report extraction confidence: note any fields that could not be found or seem incorrect.
- Handle multi-page documents by accumulating results across pages.
- For batch extraction (many documents of the same type), build a reusable extraction function and apply it across all files.
- Always preserve the original document alongside extracted data for verification.
- When patterns fail, fall back to positional extraction based on text layout.
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zoho
Integrate and automate Zoho products. Use when a user asks to work with Zoho CRM, Zoho Books, Zoho Desk, Zoho Projects, Zoho Mail, or Zoho Creator, build custom integrations via Zoho APIs, automate workflows with Deluge scripting, sync data between Zoho apps and external systems, manage leads and deals, automate invoicing, build custom Zoho Creator apps, set up webhooks, or manage Zoho organization settings. Covers Zoho CRM, Books, Desk, Projects, Creator, and cross-product integrations.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> zipkin
Deploy and configure Zipkin for distributed tracing and request flow visualization. Use when a user needs to set up trace collection, instrument Java/Spring or other services with Zipkin, analyze service dependencies, or configure storage backends for trace data.