> apache-spark
Process large-scale data with Apache Spark. Use when a user asks to process big data, run distributed computations, build ETL pipelines, perform data analysis at scale, or use PySpark for data engineering.
curl "https://skillshub.wtf/TerminalSkills/skills/apache-spark?format=md"Apache Spark
Overview
Apache Spark is the standard for distributed data processing. It handles batch processing, streaming, SQL, machine learning, and graph processing. PySpark provides a Python API. Runs on standalone clusters, YARN, Kubernetes, or managed services (Databricks, EMR, Dataproc).
Instructions
Step 1: PySpark Setup
pip install pyspark
Step 2: DataFrame Operations
# etl/process.py — PySpark data processing
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName("DataPipeline") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
# Read data
df = spark.read.parquet("s3://bucket/raw/events/")
# Transform
processed = (df
.filter(F.col("event_type").isin(["purchase", "signup"]))
.withColumn("date", F.to_date("timestamp"))
.withColumn("revenue", F.col("amount") * F.col("quantity"))
.groupBy("date", "event_type")
.agg(
F.count("*").alias("event_count"),
F.sum("revenue").alias("total_revenue"),
F.countDistinct("user_id").alias("unique_users"),
)
.orderBy("date")
)
# Write results
processed.write \
.mode("overwrite") \
.partitionBy("date") \
.parquet("s3://bucket/processed/daily_metrics/")
Step 3: SQL Interface
# Register as SQL table
df.createOrReplaceTempView("events")
result = spark.sql("""
SELECT
date_trunc('month', timestamp) as month,
COUNT(DISTINCT user_id) as monthly_active_users,
SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) as revenue
FROM events
WHERE timestamp >= '2025-01-01'
GROUP BY 1
ORDER BY 1
""")
result.show()
Step 4: Structured Streaming
# Real-time processing from Kafka
stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "events") \
.load()
parsed = stream.select(
F.from_json(F.col("value").cast("string"), schema).alias("data")
).select("data.*")
query = parsed \
.groupBy(F.window("timestamp", "5 minutes"), "event_type") \
.count() \
.writeStream \
.outputMode("update") \
.format("console") \
.start()
Guidelines
- Use DataFrames (not RDDs) for most work — they're optimized by Catalyst query optimizer.
- Partitioning is critical for performance — partition by date or high-cardinality columns.
- For managed Spark, consider Databricks (easiest), AWS EMR, or GCP Dataproc.
- PySpark syntax mirrors Pandas but executes distributed — think in columns, not rows.
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> xero-accounting
Integrate with the Xero accounting API to sync invoices, expenses, bank transactions, and contacts — and generate financial reports like P&L and balance sheet. Use when: connecting apps to Xero, automating bookkeeping workflows, syncing accounting data, or pulling financial reports programmatically.
> windsurf-rules
Configure Windsurf AI coding assistant with .windsurfrules and workspace rules. Use when: customizing Windsurf for a project, setting AI coding standards, creating team-shared Windsurf configurations, or tuning Cascade AI behavior.