> crawl4ai
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
curl "https://skillshub.wtf/TerminalSkills/skills/crawl4ai?format=md"Crawl4AI — LLM-Friendly Web Crawler
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
Core Capabilities
Basic Crawling
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://docs.example.com/getting-started",
config=CrawlerRunConfig(
word_count_threshold=10, # Skip blocks with <10 words
cache_mode=CacheMode.ENABLED,
),
)
print(result.markdown) # Clean markdown (LLM-ready)
print(result.cleaned_html) # Cleaned HTML
print(result.media["images"]) # Extracted images
print(result.links["internal"]) # Internal links
print(result.links["external"]) # External links
print(result.metadata) # Title, description, keywords
LLM-Powered Structured Extraction
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
rating: float | None
features: list[str]
extraction = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
api_token=os.environ["OPENAI_API_KEY"],
schema=Product.model_json_schema(),
instruction="Extract all products from this page",
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://shop.example.com/keyboards",
config=CrawlerRunConfig(extraction_strategy=extraction),
)
products = [Product(**p) for p in json.loads(result.extracted_content)]
for p in products:
print(f"{p.name}: ${p.price} — {p.rating}★")
CSS-Based Extraction (No LLM)
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Product listings",
"baseSelector": ".product-card", # Repeating element
"fields": [
{"name": "title", "selector": "h3.product-title", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
],
}
extraction = JsonCssExtractionStrategy(schema)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://shop.example.com/all",
config=CrawlerRunConfig(
extraction_strategy=extraction,
js_code="window.scrollTo(0, document.body.scrollHeight);", # Scroll to load more
wait_for=".product-card:nth-child(20)", # Wait for 20 items
),
)
products = json.loads(result.extracted_content)
Multi-Page Crawling
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async with AsyncWebCrawler() as crawler:
# Session-based: maintains cookies, auth state
session_id = "docs-crawl"
# Login first
await crawler.arun(
url="https://docs.example.com/login",
config=CrawlerRunConfig(
session_id=session_id,
js_code="""
document.querySelector('#email').value = 'user@example.com';
document.querySelector('#password').value = 'pass';
document.querySelector('form').submit();
""",
wait_for="#dashboard",
),
)
# Crawl authenticated pages
urls = ["https://docs.example.com/api", "https://docs.example.com/guides"]
for url in urls:
result = await crawler.arun(
url=url,
config=CrawlerRunConfig(session_id=session_id),
)
# Save markdown for RAG indexing
save_to_knowledge_base(url, result.markdown)
Installation
pip install crawl4ai
crawl4ai-setup # Install Playwright browsers
Best Practices
- Markdown output — Use
result.markdownfor LLM/RAG input; clean, structured, no HTML noise - CSS extraction — Use
JsonCssExtractionStrategyfor structured pages; no LLM cost, fast, deterministic - LLM extraction — Use
LLMExtractionStrategyfor unstructured pages; Pydantic schema ensures valid output - JavaScript rendering — Crawl4AI uses Playwright; handles SPAs, infinite scroll, dynamic content
- Sessions — Use
session_idfor multi-page crawls; maintains cookies, auth state across requests - Caching — Enable
CacheMode.ENABLEDfor development; avoid re-crawling the same pages - Wait conditions — Use
wait_forCSS selectors to ensure content is loaded before extraction - Rate limiting — Add delays between requests; respect robots.txt; be a good citizen
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> xero-accounting
Integrate with the Xero accounting API to sync invoices, expenses, bank transactions, and contacts — and generate financial reports like P&L and balance sheet. Use when: connecting apps to Xero, automating bookkeeping workflows, syncing accounting data, or pulling financial reports programmatically.
> windsurf-rules
Configure Windsurf AI coding assistant with .windsurfrules and workspace rules. Use when: customizing Windsurf for a project, setting AI coding standards, creating team-shared Windsurf configurations, or tuning Cascade AI behavior.