> crawl4ai
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
curl "https://skillshub.wtf/TerminalSkills/skills/crawl4ai?format=md"Crawl4AI — LLM-Friendly Web Crawler
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
Core Capabilities
Basic Crawling
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://docs.example.com/getting-started",
config=CrawlerRunConfig(
word_count_threshold=10, # Skip blocks with <10 words
cache_mode=CacheMode.ENABLED,
),
)
print(result.markdown) # Clean markdown (LLM-ready)
print(result.cleaned_html) # Cleaned HTML
print(result.media["images"]) # Extracted images
print(result.links["internal"]) # Internal links
print(result.links["external"]) # External links
print(result.metadata) # Title, description, keywords
LLM-Powered Structured Extraction
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
rating: float | None
features: list[str]
extraction = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
api_token=os.environ["OPENAI_API_KEY"],
schema=Product.model_json_schema(),
instruction="Extract all products from this page",
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://shop.example.com/keyboards",
config=CrawlerRunConfig(extraction_strategy=extraction),
)
products = [Product(**p) for p in json.loads(result.extracted_content)]
for p in products:
print(f"{p.name}: ${p.price} — {p.rating}★")
CSS-Based Extraction (No LLM)
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Product listings",
"baseSelector": ".product-card", # Repeating element
"fields": [
{"name": "title", "selector": "h3.product-title", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
],
}
extraction = JsonCssExtractionStrategy(schema)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://shop.example.com/all",
config=CrawlerRunConfig(
extraction_strategy=extraction,
js_code="window.scrollTo(0, document.body.scrollHeight);", # Scroll to load more
wait_for=".product-card:nth-child(20)", # Wait for 20 items
),
)
products = json.loads(result.extracted_content)
Multi-Page Crawling
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async with AsyncWebCrawler() as crawler:
# Session-based: maintains cookies, auth state
session_id = "docs-crawl"
# Login first
await crawler.arun(
url="https://docs.example.com/login",
config=CrawlerRunConfig(
session_id=session_id,
js_code="""
document.querySelector('#email').value = 'user@example.com';
document.querySelector('#password').value = 'pass';
document.querySelector('form').submit();
""",
wait_for="#dashboard",
),
)
# Crawl authenticated pages
urls = ["https://docs.example.com/api", "https://docs.example.com/guides"]
for url in urls:
result = await crawler.arun(
url=url,
config=CrawlerRunConfig(session_id=session_id),
)
# Save markdown for RAG indexing
save_to_knowledge_base(url, result.markdown)
Installation
pip install crawl4ai
crawl4ai-setup # Install Playwright browsers
Best Practices
- Markdown output — Use
result.markdownfor LLM/RAG input; clean, structured, no HTML noise - CSS extraction — Use
JsonCssExtractionStrategyfor structured pages; no LLM cost, fast, deterministic - LLM extraction — Use
LLMExtractionStrategyfor unstructured pages; Pydantic schema ensures valid output - JavaScript rendering — Crawl4AI uses Playwright; handles SPAs, infinite scroll, dynamic content
- Sessions — Use
session_idfor multi-page crawls; maintains cookies, auth state across requests - Caching — Enable
CacheMode.ENABLEDfor development; avoid re-crawling the same pages - Wait conditions — Use
wait_forCSS selectors to ensure content is loaded before extraction - Rate limiting — Add delays between requests; respect robots.txt; be a good citizen
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zoho
Integrate and automate Zoho products. Use when a user asks to work with Zoho CRM, Zoho Books, Zoho Desk, Zoho Projects, Zoho Mail, or Zoho Creator, build custom integrations via Zoho APIs, automate workflows with Deluge scripting, sync data between Zoho apps and external systems, manage leads and deals, automate invoicing, build custom Zoho Creator apps, set up webhooks, or manage Zoho organization settings. Covers Zoho CRM, Books, Desk, Projects, Creator, and cross-product integrations.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> zipkin
Deploy and configure Zipkin for distributed tracing and request flow visualization. Use when a user needs to set up trace collection, instrument Java/Spring or other services with Zipkin, analyze service dependencies, or configure storage backends for trace data.