> data-anonymization

Anonymize datasets for analytics and ML while preserving utility — k-anonymity, l-diversity, and differential privacy. Use when sharing data with third parties, building analytics on user data, or achieving GDPR Article 4 anonymization (data no longer considered personal).

fetch
$curl "https://skillshub.wtf/TerminalSkills/skills/data-anonymization?format=md"
SKILL.mddata-anonymization

Data Anonymization

Overview

Anonymized data is not "personal data" under GDPR (Article 4(1)) and therefore falls outside GDPR's scope — enabling broader data sharing, analytics, and ML without consent requirements. However, anonymization must be irreversible and withstand re-identification attacks.

Critical distinction (GDPR):

  • Anonymization: Irreversible — data can never be linked to an individual. Falls outside GDPR.
  • Pseudonymization: Reversible — data can be re-linked with additional info. Still personal data under GDPR.

Anonymization Techniques

TechniqueDescriptionPrivacy LevelUtility Loss
GeneralizationReplace specific values with ranges (age 34 → 30-40)MediumLow
SuppressionRemove records or fields that are too uniqueMediumMedium
Noise additionAdd random noise to numeric valuesHighLow
Data swappingSwap attribute values between recordsMediumLow
AggregationGroup records and report statisticsHighHigh
k-AnonymityEach record indistinguishable from k-1 othersMediumMedium
Differential PrivacyMathematical guarantee with privacy budget εVery HighVariable

k-Anonymity

Every record must be indistinguishable from at least k-1 other records on quasi-identifiers (attributes that could help identify individuals: age, ZIP, gender, etc.).

import pandas as pd
import numpy as np

def generalize_age(age: int, bucket_size: int = 10) -> str:
    """Generalize age to decade bucket."""
    lower = (age // bucket_size) * bucket_size
    return f"{lower}-{lower + bucket_size - 1}"

def generalize_zip(zip_code: str, precision: int = 3) -> str:
    """Generalize ZIP to first N digits."""
    return zip_code[:precision] + '*' * (len(zip_code) - precision)

def apply_k_anonymity(df: pd.DataFrame, quasi_identifiers: list, k: int = 5) -> pd.DataFrame:
    """
    Apply k-anonymity by generalizing quasi-identifiers and suppressing 
    groups smaller than k.
    
    Args:
        df: Input dataframe with sensitive data
        quasi_identifiers: List of column names that are quasi-identifiers
        k: Minimum group size
    
    Returns:
        k-anonymized dataframe
    """
    result = df.copy()
    
    # Generalize quasi-identifiers
    if 'age' in quasi_identifiers:
        result['age'] = result['age'].apply(lambda x: generalize_age(x, 10))
    if 'zip_code' in quasi_identifiers:
        result['zip_code'] = result['zip_code'].apply(lambda x: generalize_zip(x, 3))
    if 'gender' in quasi_identifiers:
        pass  # Gender is already categorical — keep as-is for k-anonymity
    
    # Count group sizes
    group_sizes = result.groupby(quasi_identifiers).size().reset_index(name='count')
    result = result.merge(group_sizes, on=quasi_identifiers)
    
    # Suppress groups smaller than k
    before_count = len(result)
    result = result[result['count'] >= k].drop('count', axis=1)
    after_count = len(result)
    
    suppressed = before_count - after_count
    print(f"k-anonymity (k={k}): suppressed {suppressed}/{before_count} records ({suppressed/before_count*100:.1f}%)")
    
    return result

# Example usage
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank', 'Grace', 'Hank'],
    'age': [34, 28, 45, 51, 33, 29, 46, 52],
    'zip_code': ['10001', '10002', '10001', '10003', '10001', '10002', '10001', '10003'],
    'gender': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M'],
    'diagnosis': ['Diabetes', 'Hypertension', 'Diabetes', 'Asthma', 'Diabetes', 'Hypertension', 'Diabetes', 'Asthma']
})

# Remove direct identifiers first
df_anon = df.drop('name', axis=1)

# Apply k-anonymity with k=2
quasi_ids = ['age', 'zip_code', 'gender']
result = apply_k_anonymity(df_anon, quasi_ids, k=2)
print(result)

l-Diversity

Extends k-anonymity by requiring each equivalence class to have at least l "well-represented" sensitive attribute values (preventing attribute disclosure attacks).

def check_l_diversity(df: pd.DataFrame, quasi_identifiers: list, 
                       sensitive_attr: str, l: int = 2) -> bool:
    """Check if dataset satisfies l-diversity."""
    groups = df.groupby(quasi_identifiers)[sensitive_attr]
    
    for name, group in groups:
        unique_values = group.nunique()
        if unique_values < l:
            print(f"Group {name} has only {unique_values} unique '{sensitive_attr}' values (need {l})")
            return False
    
    print(f"Dataset satisfies {l}-diversity for '{sensitive_attr}'")
    return True

# Check if our anonymized dataset satisfies 2-diversity
check_l_diversity(result, quasi_ids, 'diagnosis', l=2)

Differential Privacy

Differential privacy (DP) provides a mathematical guarantee: the output of an analysis reveals minimal information about any individual. Controlled by privacy budget ε (epsilon) — smaller ε = stronger privacy but less accurate results.

# Using Google's diffprivlib (Python)
import diffprivlib as dp
import numpy as np

# Generate sample data
np.random.seed(42)
ages = np.random.randint(18, 80, size=1000)
salaries = np.random.normal(65000, 20000, size=1000)

# DP mean — add Laplace noise calibrated to sensitivity / epsilon
epsilon = 1.0  # Privacy budget (lower = more private)

def dp_mean(values: np.ndarray, epsilon: float, sensitivity: float) -> float:
    """Compute differentially private mean using Laplace mechanism."""
    true_mean = np.mean(values)
    # Laplace noise scale = sensitivity / epsilon
    noise = np.random.laplace(0, sensitivity / epsilon)
    return true_mean + noise

def dp_count(values: np.ndarray, epsilon: float) -> int:
    """Compute differentially private count."""
    true_count = len(values)
    noise = np.random.laplace(0, 1.0 / epsilon)  # sensitivity = 1 for counting
    return max(0, int(true_count + noise))

# Using diffprivlib tools
dp_mean_age = dp.tools.mean(ages, epsilon=epsilon, bounds=(18, 80))
dp_std_salary = dp.tools.std(salaries, epsilon=epsilon, bounds=(0, 200000))

print(f"True mean age: {np.mean(ages):.1f} | DP mean age: {dp_mean_age:.1f}")
print(f"True std salary: {np.std(salaries):.0f} | DP std salary: {dp_std_salary:.0f}")

# DP histogram
def dp_histogram(values: np.ndarray, bins: list, epsilon: float) -> dict:
    """Compute differentially private histogram."""
    counts, edges = np.histogram(values, bins=bins)
    # Add Laplace noise to each bin count
    noisy_counts = counts + np.random.laplace(0, 1.0/epsilon, size=len(counts))
    noisy_counts = np.maximum(0, noisy_counts).astype(int)  # Clip to non-negative
    
    return {
        f"{int(edges[i])}-{int(edges[i+1])}": int(noisy_counts[i])
        for i in range(len(noisy_counts))
    }

age_histogram = dp_histogram(ages, bins=[18, 30, 40, 50, 60, 70, 80], epsilon=epsilon)
print(f"DP Age distribution: {age_histogram}")

Privacy Budget Management

class PrivacyBudget:
    """Track cumulative privacy budget consumption across queries."""
    
    def __init__(self, total_epsilon: float):
        self.total_epsilon = total_epsilon
        self.spent_epsilon = 0.0
        self.query_log = []
    
    def consume(self, epsilon: float, query_name: str) -> bool:
        """Consume epsilon from budget. Returns True if query is allowed."""
        if self.spent_epsilon + epsilon > self.total_epsilon:
            print(f"❌ Budget exhausted. Spent: {self.spent_epsilon}, Requested: {epsilon}, Total: {self.total_epsilon}")
            return False
        
        self.spent_epsilon += epsilon
        self.query_log.append({"query": query_name, "epsilon": epsilon, "cumulative": self.spent_epsilon})
        print(f"✅ {query_name}: ε={epsilon} (cumulative: {self.spent_epsilon:.2f}/{self.total_epsilon})")
        return True
    
    def remaining(self) -> float:
        return self.total_epsilon - self.spent_epsilon

# Usage
budget = PrivacyBudget(total_epsilon=5.0)

if budget.consume(1.0, "mean_age_query"):
    result = dp_mean(ages, epsilon=1.0, sensitivity=62)  # max-min range / n

if budget.consume(2.0, "histogram_salary"):
    hist = dp_histogram(salaries, bins=[0, 30000, 50000, 80000, 200000], epsilon=2.0)

Tools Overview

ToolLanguageStrengths
ARXJava (GUI + API)Full k-anonymity, l-diversity, t-closeness
pycanonPythonCheck k-anonymity, l-diversity, t-closeness
diffprivlibPythonIBM's DP library, sklearn compatible
Google DPC++/Go/Java/PythonProduction-ready, open source
Apple DPSwiftShuffling model DP
OpenDPPython/RustAcademic, expressive DP framework

GDPR Compliance Note

Per GDPR Article 4 and Recital 26, data is considered anonymous (and outside GDPR scope) when:

  1. Singling out: Impossible to isolate one individual
  2. Linkability: Impossible to link records relating to the same individual
  3. Inference: Impossible to deduce information about an individual

k-anonymity alone is often insufficient for GDPR anonymization standard — combine with l-diversity and/or differential privacy, and conduct a re-identification risk assessment.

Compliance Checklist

  • Direct identifiers removed (name, email, SSN, etc.)
  • Quasi-identifiers identified (age, ZIP, gender, etc.)
  • k-anonymity applied (k ≥ 5 recommended)
  • l-diversity verified for sensitive attributes
  • Re-identification risk assessment documented
  • Differential privacy applied for published statistics
  • Privacy budget tracked and documented
  • Data sharing agreement specifies anonymization standard
  • Anonymization process documented and reproducible

┌ stats

installs/wk0
░░░░░░░░░░
github stars21
████░░░░░░
first seenMar 23, 2026
└────────────

┌ repo

TerminalSkills/skills
by TerminalSkills
└────────────