> pdf-ocr
Extract text from scanned PDFs using optical character recognition. Use when a user asks to OCR a PDF, read a scanned document, extract text from an image PDF, digitize a scanned file, convert a scanned PDF to text, or read text from a photograph of a document. Supports multiple languages and handles low-quality scans.
curl "https://skillshub.wtf/TerminalSkills/skills/pdf-ocr?format=md"PDF OCR
Overview
Extract readable text from scanned or image-based PDF documents using optical character recognition (OCR). This skill converts PDF pages to images, runs OCR to detect text, and outputs clean structured text. Handles multi-page documents, multiple languages, and low-quality scans with preprocessing.
Instructions
When a user asks to OCR a scanned PDF or extract text from an image-based PDF, follow these steps:
Step 1: Check if OCR is actually needed
First, attempt normal text extraction. If the PDF already contains selectable text, OCR is unnecessary:
import pdfplumber
def check_text_content(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages[:3]:
text = page.extract_text()
if text and len(text.strip()) > 50:
return True # Has extractable text, OCR not needed
return False # Image-only PDF, needs OCR
Step 2: Install and verify dependencies
Ensure the required tools are available:
# Install Tesseract OCR engine
# Ubuntu/Debian:
sudo apt-get install tesseract-ocr
# macOS:
brew install tesseract
# Install Python packages
pip install pytesseract pdf2image Pillow
# For additional languages:
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-jpn # Japanese
Step 3: Convert PDF pages to images
from pdf2image import convert_from_path
def pdf_to_images(pdf_path, dpi=300):
images = convert_from_path(pdf_path, dpi=dpi)
return images
Use 300 DPI for standard documents. Increase to 400-600 DPI for small text or low-quality scans.
Step 4: Preprocess images for better accuracy
Apply preprocessing to improve OCR quality:
from PIL import Image, ImageFilter, ImageEnhance
def preprocess_image(image):
# Convert to grayscale
gray = image.convert('L')
# Increase contrast
enhancer = ImageEnhance.Contrast(gray)
enhanced = enhancer.enhance(2.0)
# Sharpen
sharpened = enhanced.filter(ImageFilter.SHARPEN)
# Binarize (threshold)
threshold = 150
binary = sharpened.point(lambda x: 255 if x > threshold else 0)
return binary
Step 5: Run OCR on each page
import pytesseract
def ocr_pages(images, lang='eng'):
results = []
for i, image in enumerate(images):
processed = preprocess_image(image)
text = pytesseract.image_to_string(processed, lang=lang)
results.append({
"page": i + 1,
"text": text.strip(),
"confidence": get_confidence(processed, lang)
})
return results
def get_confidence(image, lang='eng'):
data = pytesseract.image_to_data(image, lang=lang, output_type=pytesseract.Output.DICT)
confidences = [int(c) for c in data['conf'] if int(c) > 0]
return sum(confidences) / len(confidences) if confidences else 0
Step 6: Output the results
Combine and format the extracted text. Save as a text file or return directly:
def save_results(results, output_path):
with open(output_path, 'w', encoding='utf-8') as f:
for page in results:
f.write(f"--- Page {page['page']} (confidence: {page['confidence']:.0f}%) ---\n")
f.write(page['text'] + "\n\n")
return output_path
Examples
Example 1: OCR a scanned contract
User request: "Extract text from this scanned contract scan_contract.pdf"
Actions taken:
- Check for existing text layer - none found, OCR needed
- Convert 5 pages to images at 300 DPI
- Preprocess and run OCR in English
Output:
OCR completed for scan_contract.pdf (5 pages)
Page-by-page confidence:
Page 1: 96% confidence
Page 2: 94% confidence
Page 3: 91% confidence
Page 4: 95% confidence
Page 5: 88% confidence (lower quality scan detected)
Output saved to: scan_contract_text.txt (4,230 words extracted)
Note: Page 5 had lower image quality. Review that page for accuracy.
Example 2: OCR a multi-language document
User request: "Read this scanned document, it's in German: rechnung.pdf"
Actions taken:
- Verify tesseract-ocr-deu language pack is installed
- Convert pages to images at 300 DPI
- Run OCR with
lang='deu'
Output:
OCR completed for rechnung.pdf (2 pages) using German language model
Page 1: 93% confidence
Page 2: 95% confidence
Extracted 812 words. Output saved to: rechnung_text.txt
Example 3: Batch OCR multiple scanned files
User request: "OCR all the scanned PDFs in the ./receipts/ folder"
Actions taken:
- Find all PDF files in ./receipts/ (found 12 files)
- Check each for existing text layer
- Run OCR on the 10 files that need it
Output:
Batch OCR complete: 12 files processed
Already had text: 2 files (skipped)
OCR completed: 10 files
Average confidence: 92%
Output files saved to ./receipts/ocr_output/
receipt_001_text.txt (97% confidence)
receipt_002_text.txt (94% confidence)
...
receipt_010_text.txt (85% confidence - review recommended)
Guidelines
- Always check for existing text content before running OCR. Many PDFs already have a text layer.
- Use 300 DPI as the default resolution. Increase for small fonts or poor quality scans.
- Report confidence scores per page so users know which pages may need manual review.
- For multi-language documents, specify the correct Tesseract language code. Multiple languages can be combined:
lang='eng+deu'. - Preprocess images before OCR: grayscale conversion, contrast enhancement, and binarization significantly improve accuracy.
- For rotated or skewed scans, apply deskewing before OCR using image rotation detection.
- Large PDFs should be processed page by page to manage memory usage.
- Common Tesseract language codes: eng (English), deu (German), fra (French), spa (Spanish), jpn (Japanese), chi_sim (Chinese Simplified), kor (Korean).
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zoho
Integrate and automate Zoho products. Use when a user asks to work with Zoho CRM, Zoho Books, Zoho Desk, Zoho Projects, Zoho Mail, or Zoho Creator, build custom integrations via Zoho APIs, automate workflows with Deluge scripting, sync data between Zoho apps and external systems, manage leads and deals, automate invoicing, build custom Zoho Creator apps, set up webhooks, or manage Zoho organization settings. Covers Zoho CRM, Books, Desk, Projects, Creator, and cross-product integrations.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> zipkin
Deploy and configure Zipkin for distributed tracing and request flow visualization. Use when a user needs to set up trace collection, instrument Java/Spring or other services with Zipkin, analyze service dependencies, or configure storage backends for trace data.