> soda
You are an expert in Soda, the data quality platform for testing, monitoring, and profiling data. You help developers write data quality checks in YAML that validate freshness, completeness, uniqueness, validity, and business rules — catching data issues before they reach dashboards and ML models.
curl "https://skillshub.wtf/TerminalSkills/skills/soda?format=md"Soda — Data Quality Testing and Monitoring
You are an expert in Soda, the data quality platform for testing, monitoring, and profiling data. You help developers write data quality checks in YAML that validate freshness, completeness, uniqueness, validity, and business rules — catching data issues before they reach dashboards and ML models.
Core Capabilities
Check Definitions
# checks/orders_checks.yaml — Data quality checks
checks for orders:
# Freshness — data should be recent
- freshness(created_at) < 2h:
name: "Order data is fresh"
# Row count — catch pipeline breaks
- row_count > 0:
name: "Orders table is not empty"
- row_count between 100 and 10000:
name: "Daily order volume is normal"
# Completeness — no missing critical fields
- missing_count(customer_id) = 0:
name: "Every order has a customer"
- missing_count(total_amount) = 0
- missing_percent(shipping_address) < 5%:
name: "Less than 5% missing addresses"
# Uniqueness — no duplicates
- duplicate_count(order_id) = 0:
name: "No duplicate order IDs"
# Validity — correct values
- invalid_count(status) = 0:
valid values: [pending, processing, shipped, delivered, cancelled, refunded]
- invalid_count(total_amount) = 0:
valid min: 0
name: "No negative order amounts"
# Business rules — custom SQL
- failed_rows:
fail query: |
SELECT * FROM orders
WHERE status = 'delivered' AND delivered_at IS NULL
name: "Delivered orders must have delivery date"
# Anomaly detection — ML-based
- anomaly detection for row_count:
name: "No unusual spikes or drops in order volume"
severity: warn
# Schema — structural checks
- schema:
fail:
when required column missing: [order_id, customer_id, total_amount, status, created_at]
warn:
when forbidden column present: [__deleted, _temp_*]
Configuration and Running
# configuration.yml — Soda connection config
data_source my_warehouse:
type: bigquery
project_id: my-project
dataset: analytics
credentials: ${GCP_CREDENTIALS}
# Alternative: PostgreSQL
data_source my_postgres:
type: postgres
host: localhost
port: 5432
database: analytics
username: ${DB_USER}
password: ${DB_PASS}
# Run checks
soda scan -d my_warehouse -c configuration.yml checks/orders_checks.yaml
# Run all checks in a directory
soda scan -d my_warehouse -c configuration.yml checks/
# Output:
# Soda Library 1.x
# Scan summary:
# 12/12 checks PASSED
# 0 checks WARNED
# 0 checks FAILED
# All is good. No failures. No warnings. No errors.
CI/CD Integration
# .github/workflows/data-quality.yml
name: Data Quality Checks
on:
schedule:
- cron: "0 */4 * * *" # Every 4 hours
workflow_dispatch: # Manual trigger
jobs:
soda-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install soda-core-bigquery
- run: soda scan -d warehouse -c soda/configuration.yml soda/checks/
env:
GCP_CREDENTIALS: ${{ secrets.GCP_CREDENTIALS }}
- name: Notify on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: '{"text": "⚠️ Data quality check failed! Check GitHub Actions."}'
Programmatic Usage
from soda.scan import Scan
scan = Scan()
scan.set_data_source_name("my_warehouse")
scan.add_configuration_yaml_file("configuration.yml")
scan.add_sodacl_yaml_str("""
checks for customers:
- row_count > 0
- missing_count(email) = 0
- duplicate_count(email) = 0
""")
scan.execute()
print(scan.get_scan_results())
if scan.has_check_fails():
# Trigger alert, block pipeline, etc.
raise Exception(f"Data quality checks failed: {scan.get_checks_fail_text()}")
Installation
pip install soda-core-bigquery # BigQuery
pip install soda-core-postgres # PostgreSQL
pip install soda-core-snowflake # Snowflake
pip install soda-core-duckdb # DuckDB
Best Practices
- Check after every sync — Run Soda checks after Airbyte/dbt runs; catch issues before they reach dashboards
- Freshness checks first — Start with
freshness()checks; stale data is the most common silent failure - Anomaly detection — Use ML-based anomaly detection for volume and metric checks; catches issues you can't anticipate
- Failed rows for debugging — Use
failed_rowswithfail queryfor business logic checks; shows exactly which rows are bad - Severity levels — Use
severity: warnfor non-critical checks;fail(default) blocks pipelines - Schema checks — Validate required columns exist; catches upstream schema changes before they break transforms
- CI/CD integration — Run checks on schedule and in PR pipelines for dbt model changes
- Soda Cloud — Use Soda Cloud for historical trends, dashboards, and incident management; CLI is free for basic checks
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zoho
Integrate and automate Zoho products. Use when a user asks to work with Zoho CRM, Zoho Books, Zoho Desk, Zoho Projects, Zoho Mail, or Zoho Creator, build custom integrations via Zoho APIs, automate workflows with Deluge scripting, sync data between Zoho apps and external systems, manage leads and deals, automate invoicing, build custom Zoho Creator apps, set up webhooks, or manage Zoho organization settings. Covers Zoho CRM, Books, Desk, Projects, Creator, and cross-product integrations.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> zipkin
Deploy and configure Zipkin for distributed tracing and request flow visualization. Use when a user needs to set up trace collection, instrument Java/Spring or other services with Zipkin, analyze service dependencies, or configure storage backends for trace data.