> Apache Spark with PySpark
Large-scale data processing with PySpark DataFrames, SQL, UDFs, and window functions.
fetch
$
curl "https://skillshub.wtf/skillshub-team/catalog-batch5/pyspark-basics?format=md"SKILL.md•Apache Spark with PySpark
PySpark
SparkSession
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("ETL").getOrCreate()
DataFrames
df = spark.read.parquet("s3://data/")
result = (df.filter(F.col("age") > 18)
.groupBy("country")
.agg(F.count("*").alias("cnt"), F.avg("age").alias("avg_age"))
.orderBy(F.desc("cnt")))
result.write.parquet("output/", mode="overwrite", partitionBy=["country"])
Window Functions
from pyspark.sql.window import Window
w = Window.partitionBy("dept").orderBy(F.desc("salary"))
df.withColumn("rank", F.row_number().over(w))
Pandas UDF (10-100x faster than regular UDF)
@pandas_udf(StringType())
def categorize(ages: pd.Series) -> pd.Series:
return ages.apply(lambda a: "minor" if a < 18 else "adult")
> related_skills --same-repo
> Nix Dev Shells with direnv
Auto-activate reproducible dev environments with Nix flakes and direnv.
> Dagger with GitHub Actions
Run Dagger CI/CD pipelines in GitHub Actions for portable, testable builds.
> Bun + Hono API
Build fast APIs with Bun runtime and Hono framework.
> Deno Fresh Framework
Build full-stack web apps with Fresh on Deno. Islands, routes, and zero runtime overhead.