> scikit-learn
Assists with building, evaluating, and deploying machine learning models using scikit-learn. Use when performing data preprocessing, feature engineering, model selection, hyperparameter tuning, cross-validation, or building pipelines for classification, regression, and clustering tasks. Trigger words: sklearn, scikit-learn, machine learning, classification, regression, pipeline, cross-validation.
curl "https://skillshub.wtf/TerminalSkills/skills/scikit-learn?format=md"scikit-learn
Overview
Scikit-learn is a Python machine learning library that provides a consistent API for the full ML workflow: data preprocessing (scaling, encoding, imputation), model selection (classification, regression, clustering), hyperparameter tuning (grid search, randomized search), cross-validation, and pipeline construction. It supports serialization via joblib for production deployment.
Instructions
- When preprocessing data, use
ColumnTransformerto apply different transformers to numeric and categorical columns (StandardScaler, OneHotEncoder, SimpleImputer), always within a Pipeline to prevent data leakage. - When choosing models, start with fast baselines (LogisticRegression, RandomForest) and use
HistGradientBoostingClassifierfor best tabular performance, since it handles missing values natively and is faster than GradientBoosting. - When evaluating, use
cross_val_scorewith 5-fold CV instead of single train/test splits, and useclassification_report()instead of accuracy alone since accuracy is misleading on imbalanced datasets. - When tuning hyperparameters, use
RandomizedSearchCVwhen the search space exceeds 100 combinations (faster than exhaustive GridSearchCV), and useStratifiedKFoldorTimeSeriesSplitas appropriate. - When building pipelines, chain preprocessing and model steps with
Pipelineto ensure transformers fit only on training data, then serialize the full pipeline withjoblib.dump()for deployment. - When selecting features, use
permutation_importance()for model-agnostic measurement,SelectKBestfor statistical filtering, orfeature_importances_from tree-based models.
Examples
Example 1: Build a customer churn prediction pipeline
User request: "Create a model to predict which customers will churn"
Actions:
- Build a
ColumnTransformerwithStandardScalerfor numeric features andOneHotEncoderfor categorical - Create a
Pipelinewith the transformer andHistGradientBoostingClassifier - Tune hyperparameters with
RandomizedSearchCVusingStratifiedKFold - Evaluate with
classification_report()focusing on recall for the churn class
Output: A tuned churn prediction pipeline with preprocessing, model, and evaluation metrics.
Example 2: Cluster customers into segments
User request: "Segment customers based on purchasing behavior"
Actions:
- Preprocess features with
StandardScalerin a pipeline - Use
KMeanswith silhouette score analysis to determine optimal cluster count - Run
PCAfor dimensionality reduction and visualization - Profile clusters with
groupbyon original features to interpret segments
Output: Customer segments with labeled profiles and a visual cluster map.
Guidelines
- Always use
Pipelineto prevent data leakage by fitting transformers only on training data. - Use
ColumnTransformerfor mixed data types: numeric scaling and categorical encoding in one object. - Use
HistGradientBoostingClassifieroverGradientBoostingClassifiersince it is faster and handles missing values natively. - Use
cross_val_scorewith 5-fold CV rather than a single train/test split since single splits are noisy. - Use
RandomizedSearchCVwhen the search space exceeds 100 combinations. - Use
classification_report()not just accuracy, which is misleading on imbalanced datasets. - Serialize the full pipeline with
joblib, not just the model, since deployment needs preprocessing too.
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> xero-accounting
Integrate with the Xero accounting API to sync invoices, expenses, bank transactions, and contacts — and generate financial reports like P&L and balance sheet. Use when: connecting apps to Xero, automating bookkeeping workflows, syncing accounting data, or pulling financial reports programmatically.
> windsurf-rules
Configure Windsurf AI coding assistant with .windsurfrules and workspace rules. Use when: customizing Windsurf for a project, setting AI coding standards, creating team-shared Windsurf configurations, or tuning Cascade AI behavior.