Machine Learning Algorithms Explained End-to-End Pipeline Guide

Posted Sep 27, 2025 Updated Sep 27, 2025

By Kuibin Lin

views 21 min read

A comprehensive guide to machine learning models in scikit-learn with beginner-friendly explanations, analogies.

1. Supervised Learning

Concept: We know the target labels – like having the answer key while practicing.

1.1 Linear Models

With regularization options

  
from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet,
    LogisticRegression, SGDClassifier, SGDRegressor,
    Perceptron, PassiveAggressiveClassifier,
    BayesianRidge, ARDRegression
)

LinearRegression()
Ridge()
Lasso()
ElasticNet()
LogisticRegression()
SGDClassifier()
SGDRegressor()
Perceptron()
PassiveAggressiveClassifier()
BayesianRidge()
ARDRegression()

📦 Key Concepts:

Classification → predict categories (e.g., spam vs. not spam)
Regression → predict numbers (e.g., house prices)
Regularization → Ridge, Lasso, ElasticNet prevent overfitting

💡 Use Cases:

Predicting house prices (LinearRegression, Ridge, Lasso)
Email spam detection (LogisticRegression)
Online learning with streaming data (SGDClassifier, PassiveAggressiveClassifier)

1.2 Trees & Ensembles

  
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor,
    ExtraTreesClassifier, ExtraTreesRegressor,
    GradientBoostingClassifier, GradientBoostingRegressor,
    HistGradientBoostingClassifier, HistGradientBoostingRegressor,
    AdaBoostClassifier, AdaBoostRegressor,
    BaggingClassifier, BaggingRegressor
)

What are Trees?

Analogy: Think of a flowchart for making decisions. Like when choosing what to wear: “Is it raining? → If yes, take umbrella → If no, check temperature → If cold, wear jacket…”

A Decision Tree asks a series of yes/no questions about your data to reach a final decision. Each question splits the data into smaller groups until you get a clear answer.

What are Ensembles?

Analogy: Instead of asking one expert for advice, you ask multiple experts and take the majority vote. Like asking 10 friends “Should I watch this movie?” and going with what most of them say.

Ensembles combine many decision trees together:

Random Forest = Ask 100 different tree “experts,” each looking at random parts of the data
- Training: All trees work in parallel (like 100 chefs cooking different dishes simultaneously)
- Fast training because trees don’t wait for each other
Gradient Boosting = Trees learn from each other’s mistakes, getting better step by step
- Training: Trees work sequentially (like students in a relay race, each learning from the previous one)
- Slower training but often more accurate because each tree fixes the previous tree’s errors

📦 Key Concepts:

Work for both classification & regression
No explicit regularization, but trees/ensembles resist overfitting with parameters (e.g., max_depth, n_estimators)

💡 Use Cases:

Customer churn prediction (RandomForestClassifier)
Sales forecasting (GradientBoostingRegressor)
Fast interpretable rules (DecisionTreeClassifier)

1.3 Support Vector Machines

  
from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR

What is SVM?

Analogy: Imagine you’re in the school playground during recess. The basketball kids always hang out on one side, and the soccer kids hang out on the other side.

Your job is to draw a chalk line down the middle to separate them. But you want to make the widest possible gap so neither group accidentally crosses into the other’s territory. You find the safest spot where there’s the most space between the closest basketball kid and the closest soccer kid.

That’s what SVM does - it finds the best line with the biggest safety zone on both sides!

Why is SVM Special?

Analogy: Sometimes the playground is messy - basketball kids and soccer kids are all mixed up, and you can’t draw a straight line to separate them.

But SVM has a magic trick! It’s like climbing up to the school roof to look down. From up high, you can suddenly see a clear way to separate the groups that you couldn’t see from ground level.

SVM can “look” at your data from different angles (called “kernels”) to find patterns that aren’t obvious at first glance.

Simple Examples:

Straight line separation → Like dividing boys and girls in class with a straight line
Curved separation → Like drawing a circle around all the 7th graders vs 8th graders

📦 Key Concepts:

SVC/LinearSVC → Sorting things into groups (cats vs dogs, spam vs real emails)
SVR/LinearSVR → Predicting numbers (like guessing someone’s height based on their shoe size)
Parameter C → How strict you want to be (strict teacher vs lenient teacher)

💡 Use Cases:

Recognizing handwritten numbers → Telling apart 0, 1, 2, 3, etc.
Email spam detection → Separating junk mail from important emails
Photo recognition → Identifying cats vs dogs in pictures
Medical diagnosis → Helping doctors spot diseases from test results

1.4 Nearest Neighbors

  
from sklearn.neighbors import (
    KNeighborsClassifier, KNeighborsRegressor,
    RadiusNeighborsClassifier, RadiusNeighborsRegressor
)

What is Nearest Neighbors? Analogy: Imagine you’re new to school and want to know what’s cool to wear. You look at the 5 kids sitting closest to you and see what they’re wearing. If 3 out of 5 are wearing sneakers, you decide sneakers are the way to go!

That’s exactly how k-Nearest Neighbors works - it looks at the k closest examples in your data and makes decisions based on what the majority of neighbors are doing.

How Does It Work? Two Simple Steps:

Find the closest neighbors → Like finding the 5 students with the most similar test scores to yours
Copy what they do → If most got an A, predict you’ll get an A too

Key Differences:

KNeighbors → You choose exactly k neighbors (like always asking exactly 5 friends)
RadiusNeighbors → You choose everyone within a distance (like asking all friends within 10 seats of you)

📦 Key Concepts:

Classification → What category? (Will this student like pizza or burgers?)
Regression → What number? (How tall will this student be?)
No training needed → Just memorizes all examples (like keeping a phone book)
Parameter k → How many neighbors to ask (more neighbors = safer but less specific)

💡 Real-World Use Cases:

Movie recommendations → “People who liked Spider-Man also liked Batman”
Housing prices → “Houses in your neighborhood sold for $300k, so yours might too”
Medical diagnosis → “Patients with similar symptoms usually have this condition”
E-commerce → “Customers who bought this also bought that”
Image recognition → “This photo looks most like pictures labeled ‘cat’”
Fraud detection → “This transaction looks like previous fraudulent ones”

🎯 When to Use:

Small datasets → Fast and simple
Irregular patterns → No assumptions about data shape
Local patterns matter → Nearby examples are most important
Quick prototyping → No training time needed

1.5 Naive Bayes

  
from sklearn.naive_bayes import (
    GaussianNB, MultinomialNB, BernoulliNB, CategoricalNB, ComplementNB
)

What is Naive Bayes?

Analogy: Imagine you’re a detective trying to guess if someone is a basketball player. You notice they’re tall, wearing sneakers, and have a sports bag. You think:

“90% of tall people I know play basketball”
“80% of people wearing sneakers play sports”
“70% of people with sports bags play basketball”

Naive Bayes combines all these probabilities to make a smart guess. It’s called “naive” because it assumes each clue is independent (height doesn’t affect shoe choice), which isn’t always true, but it works surprisingly well!

How Does It Work?

Simple Steps:

Learn patterns → “In spam emails, ‘FREE’ appears 80% of the time”
Calculate probabilities → “This email has ‘FREE’ and ‘URGENT’, what are the chances it’s spam?”
Pick the most likely → “95% chance spam, 5% chance normal email → It’s spam!”

Different Types:

GaussianNB → For normal numbers (height, weight, temperature)
MultinomialNB → For counting things (word frequency in emails)
BernoulliNB → For yes/no features (does email contain “FREE”?)
CategoricalNB → For categories (color: red/blue/green)
ComplementNB → Special version of Multinomial for imbalanced data

📦 Key Concepts:

Classification only → Sorting things into categories
Probabilistic → Gives you confidence scores (“85% sure this is spam”)
Super fast → Trains in seconds, predicts instantly
Handles missing data → Still works when some information is missing
Works with small data → Doesn’t need thousands of examples

💡 Real-World Use Cases:

Email spam filtering → Classic use case, still widely used today
Text sentiment analysis → “Is this movie review positive or negative?”
News categorization → Sports, Politics, Entertainment, etc.
Medical diagnosis → Based on symptoms and test results
Real-time recommendations → Fast enough for live website suggestions
Social media monitoring → Classify tweets as positive/negative/neutral
Document classification → Sort legal documents, research papers
Language detection → “Is this text English, Spanish, or French?”

🎯 Why Choose Naive Bayes:

Lightning fast → Train and predict in milliseconds
Simple to understand → Easy to explain to non-technical people
Works with small data → Good baseline even with limited examples
Handles text beautifully → Excellent for any word-based analysis
Probability scores → Not just predictions, but confidence levels
Robust baseline → Often surprisingly competitive with complex models

1.6 Gaussian Processes & Neural Nets

  
from sklearn.gaussian_process import GaussianProcessClassifier, GaussianProcessRegressor
from sklearn.neural_network import MLPClassifier, MLPRegressor

What are These Advanced Models?

Neural Networks (MLP)

Analogy: Think of your brain making a decision. You have layers of neurons - some take in information (eyes see a dog), middle layers process it (recognize fur, tail, ears), and final layers decide (yep, it’s a dog!).

Multi-Layer Perceptron (MLP) works the same way:

Input layer → Takes your data (age, income, etc.)
Hidden layers → Like brain neurons, find complex patterns
Output layer → Makes final decision or prediction

Why Neural Networks Are Special:

Can learn ANY pattern → Like a super-flexible curve that bends to fit your data perfectly
Deep learning lite → Baby version of the AI that powers ChatGPT and image recognition
Self-adjusting → Automatically finds the best way to combine features

Gaussian Processes

Analogy: Imagine you’re drawing a map of unknown territory. You have a few known points, but need to guess what’s between them. Gaussian Processes draw the smoothest possible lines between known points, plus give you confidence intervals (how sure you are about each guess).

Why Gaussian Processes Are Special:

Uncertainty quantification → Tells you “I’m 90% confident the answer is between 5-7”
Works with tiny datasets → Can make good guesses with just 10-20 examples
No overfitting → Built-in protection against memorizing noise

📦 Key Concepts:

Both classification & regression → Sort categories or predict numbers
Non-linear patterns → Can handle curves, waves, and complex relationships
Neural Net regularization → alpha parameter prevents overfitting (like dropout)
Probabilistic predictions → Gaussian Processes give confidence scores

💡 Real-World Use Cases:

Neural Networks (MLP):

Medical diagnosis → Combining symptoms, test results, and patient history
Financial prediction → Stock prices, credit scoring, fraud detection
Marketing optimization → Customer lifetime value, ad targeting
Engineering → Quality control, sensor data analysis
Gaming AI → Simple game agents and decision-making
Pattern recognition → Basic image/audio classification

Gaussian Processes:

Scientific experiments → Optimize experiments with minimal trials
Robotics → Safe path planning with uncertainty
Environmental modeling → Climate data with sparse measurements
Drug discovery → Molecular property prediction with limited data
A/B testing → Website optimization with uncertainty bounds
Sensor networks → Interpolate readings between sensor locations

🎯 When to Use:

Choose Neural Networks (MLP) when:

Complex, non-linear patterns → Data has weird curves and interactions
Medium-to-large datasets → You have thousands of examples
High accuracy needed → Willing to trade interpretability for performance
Multiple features interact → Features depend on each other in complex ways

Choose Gaussian Processes when:

Small datasets → You only have 10-100 examples
Uncertainty matters → Need to know “how sure” you are
Smooth functions → Expecting gradual changes, not sudden jumps
Scientific applications → Need mathematically principled uncertainty

2. Unsupervised Learning

👉 Analogy: Imagine you walk into a new city where you don’t know the neighborhoods. Nobody tells you which area is “rich” or “student-friendly” — you just explore and find patterns yourself. That’s what unsupervised learning does: it groups or simplifies data without labels.

2.1 Clustering

  
from sklearn.cluster import (
    KMeans, MiniBatchKMeans, DBSCAN, OPTICS,
    AgglomerativeClustering, SpectralClustering,
    Birch, MeanShift
)

📦 Analogy: Imagine you walk into a grocery store and observe shoppers. You notice some people always buy fruits, some buy only snacks, and some buy vegetables. Without anyone telling you, you can cluster shoppers into groups based on their behavior.

📊 Common Algorithms:

KMeans: Like putting customers into k shopping baskets (you decide k beforehand)
DBSCAN: Like spotting groups of friends at a party — dense groups are clusters, lonely people are “outliers”
AgglomerativeClustering: Like building a family tree of similarities, merging closest relatives step by step

💡 Use Cases:

Group customers by shopping behavior (KMeans)
Fraud detection and anomaly identification (DBSCAN)

2.2 Dimensionality Reduction

  
from sklearn.decomposition import (
    PCA, IncrementalPCA, KernelPCA, TruncatedSVD,
    NMF, FactorAnalysis, FastICA
)

📦 Analogy: Imagine you have a huge recipe book written in many languages. Instead of keeping all words, you summarize into main themes: “sweet,” “spicy,” “vegan.” You keep fewer dimensions but still capture the essential meaning.

📊 Common Algorithms:

PCA: Like compressing a photo — fewer pixels, but you still recognize the picture
NMF: Like summarizing movie reviews into topics: “acting,” “visuals,” “story”
FactorAnalysis: Like finding hidden factors behind test scores (math ability, reading skills, memory)

💡 Use Cases:

Reduce image dimensions before classification (PCA)
Topic discovery in text documents (NMF)

2.3 Manifold Learning

📦 Analogy: Imagine you have a huge recipe book written in many languages. Instead of keeping all words, you summarize into main themes: “sweet,” “spicy,” “vegan.” You keep fewer dimensions but still capture meaning.

📊 Algorithms & Uses:

PCA: Like compressing a photo — fewer pixels, but you still recognize the picture.
NMF: Like summarizing movie reviews into topics: “acting,” “visuals,” “story.”
FactorAnalysis: Like finding hidden factors behind test scores (math, reading, memory).

💡 Use cases:

Reduce image size before classification (PCA)
Topic discovery in text (NMF)

2.4 Density Estimation

  
from sklearn.mixture import GaussianMixture, BayesianGaussianMixture
from sklearn.neighbors import KernelDensity

📦 Analogy: Imagine you’re at a party with no seating chart. People naturally form groups: some around the food, some near the music. Density estimation finds these dense spots and describes them mathematically.

📊 Common Algorithms:

GaussianMixture: Like assuming guests belong to overlapping groups (music lovers, food lovers, social butterflies)
KernelDensity: Like putting a flashlight on the floor — brighter spots indicate more people (higher density)

💡 Use Cases:

Anomaly detection in sensor data (GaussianMixture)
Estimate probability distributions of house prices (KernelDensity)

2.5 Unsupervised Learning Summary

Clustering → “Grouping people without name tags”
Dimensionality Reduction → “Summarizing a big book into key themes”
Manifold Learning → “Unfolding crumpled paper to see the true shape”
Density Estimation → “Finding hotspots at a party”

3. Data Preprocessing

📦 Concept: Like “kitchen prep” before cooking — you clean, cut, and season ingredients so the final dish (model) works perfectly.

  
# Common preprocessing imports
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest

3.1 Scaling & Normalization

  
from sklearn.preprocessing import StandardScaler, MinMaxScaler

Analogy: Imagine comparing athletes: one runs in seconds, one jumps in meters, one lifts in kilograms. To compare them fairly, you rescale everything to the same units.

What it does:

StandardScaler: Makes features have mean=0, std=1 (like z-scores)
MinMaxScaler: Rescales features to [0,1] range

💡 Use Case:

  
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[["age", "income"]])

Normalizes age & income so neither dominates the model.

3.2 Encoding & Imputation

  
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

Analogy:

Encoding: Like ordering coffee in another country — you point at icons instead of words. Categorical values become numbers.
Imputation: If someone skips a survey question, you fill in the blank with the most common or average answer.

What it does:

OneHotEncoder: Turns categories into binary flags (red → [1,0,0], blue → [0,1,0], green → [0,0,1])
SimpleImputer: Fills missing values with mean, median, or most frequent value

💡 Use Case:

  
# Encode categorical data
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[["color"]])

# Fill missing values
imputer = SimpleImputer(strategy="mean")
df["age"] = imputer.fit_transform(df[["age"]])

3.3 Feature Generation & Selection

  
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest

Analogy:

Feature generation: Like adding “secret ingredients” — flour and water can create bread, a completely new feature
Feature selection: Like a talent show — only the best performers make it to the final

📊 What it does:

PolynomialFeatures: Creates squared/cubic terms (e.g., area = length²)
SelectKBest: Keeps only the most useful features based on statistical tests

💡 Use Case:

  
# Generate polynomial features
poly = PolynomialFeatures(degree=2)
df_poly = poly.fit_transform(df[["size"]])

# Select best features
selector = SelectKBest(k=5)
df_selected = selector.fit_transform(df.drop("target", axis=1), df["target"])

4. Pipelines & Workflows

4.1 Pipeline & ColumnTransformer

  
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

Restaurant Kitchen Analogy:

Think of a restaurant kitchen:

Vegetables (numeric features like age, income) → washing + chopping station (scaling)
Spices (categorical features like color, city) → grinding station (encoding)
Finally, all ingredients are combined and sent to the main chef (the model)

The ColumnTransformer is the kitchen manager: it decides which ingredient goes to which station.
The Pipeline is the conveyor belt that connects preprocessing → model.

🔧 Pipeline Workflow: Building Your ML Assembly Line

🏭 Assembly Line Analogy: Think of a car factory assembly line:

Station 1: Paint numeric parts (scaling)
Station 2: Assemble categorical parts (encoding)
Station 3: Quality control (the ML model)
Conveyor belt: Pipeline that moves everything smoothly from start to finish

4.2 End-to-End Machine Learning Pipeline

Step 1. Data Loading & Splitting

  
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import joblib
from copy import deepcopy

# Load and explore data
df = pd.read_csv("customer_data.csv")
print(f"Data shape: {df.shape}")
print(f"Target distribution:\n{df['target'].value_counts()}")

# Stratified split (keeps same proportion of target classes)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Identify feature types
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
print(f"Numeric features: {numeric_features}")
print(f"Categorical features: {categorical_features}")

Step 2. Preprocessing Setup

  
# Create preprocessing pipelines
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
base_preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

Step 3. Base Model Configuration

  
# Base models to experiment with
base_models = {
    'LogisticRegression': LogisticRegression(random_state=42),
    'RandomForest': RandomForestClassifier(random_state=42, n_estimators=100),
}

Step 4. Strategy Definition

  
# Create multiple strategies (preprocessing + resampling + model variations)
strategies = {}

for model_name, model in base_models.items():
    
    # Strategy 1: Basic preprocessing + model
    strategies[f"{model_name}_basic"] = Pipeline([
        ('preprocessor', deepcopy(base_preprocessor)),
        ('classifier', deepcopy(model))
    ])
    
    # Strategy 2: Preprocessing + undersampling + model
    strategies[f"{model_name}_undersample"] = Pipeline([
        ('preprocessor', deepcopy(base_preprocessor)),
        ('undersampler', RandomUnderSampler(random_state=42)),
        ('classifier', deepcopy(model))
    ])
    
    # Strategy 3: Preprocessing + SMOTE oversampling + model
    strategies[f"{model_name}_smote"] = Pipeline([
        ('preprocessor', deepcopy(base_preprocessor)),
        ('oversampler', SMOTE(random_state=42)),
        ('classifier', deepcopy(model))
    ])
    
    # Strategy 4: Class weight balanced
    if hasattr(model, 'class_weight'):
        balanced_model = deepcopy(model)
        balanced_model.class_weight = 'balanced'
        strategies[f"{model_name}_balanced"] = Pipeline([
            ('preprocessor', deepcopy(base_preprocessor)),
            ('classifier', balanced_model)
        ])

Step 5. Cross-Validation Comparison

  
# Compare all strategies using cross-validation
cv_scores = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Cross-validation results:")
print("-" * 50)

for strategy_name, pipeline in strategies.items():
    scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='roc_auc')
    cv_scores[strategy_name] = {
        'mean_score': scores.mean(),
        'std_score': scores.std(),
        'scores': scores
    }
    print(f"{strategy_name:25} | ROC-AUC: {scores.mean():.4f} (±{scores.std():.4f})")

# Rank strategies by performance
ranked_strategies = sorted(cv_scores.items(), key=lambda x: x[1]['mean_score'], reverse=True)
best_strategy_name = ranked_strategies[0][0]
print(f"\nBest strategy: {best_strategy_name}")

Step 6. Final Model Training 🏆

  
# Train the best strategy on full training data
best_pipeline = strategies[best_strategy_name]
best_pipeline.fit(X_train, y_train)

print(f"Final model trained: {best_strategy_name}")

Step 7. Test Set Evaluation 📊

  
# Evaluate on held-out test set
y_pred = best_pipeline.predict(X_test)
y_pred_proba = best_pipeline.predict_proba(X_test)[:, 1]

# Comprehensive evaluation
test_auc = roc_auc_score(y_test, y_pred_proba)
print(f"Test Set ROC-AUC: {test_auc:.4f}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Step 8. Model Persistence 💾

  
# Save complete pipeline for deployment
model_filename = f"best_model_{best_strategy_name}.pkl"
joblib.dump(best_pipeline, model_filename)

# Save preprocessor separately for flexibility
preprocessor_filename = "preprocessor.pkl"
joblib.dump(base_preprocessor.fit(X_train), preprocessor_filename)

print(f"Model saved: {model_filename}")
print(f"Preprocessor saved: {preprocessor_filename}")

Step 9. Deployment Demo 🚀

  
# Load saved model
loaded_pipeline = joblib.load(model_filename)

# Production-ready prediction function
def predict_customer_churn(customer_data):
    """
    Predict customer churn probability
    
    Args:
        customer_data: dict or DataFrame with customer features
    
    Returns:
        dict with prediction and probability
    """
    if isinstance(customer_data, dict):
        customer_data = pd.DataFrame([customer_data])
    
    prediction = loaded_pipeline.predict(customer_data)[0]
    probability = loaded_pipeline.predict_proba(customer_data)[0, 1]
    
    return {
        'prediction': 'Churn' if prediction == 1 else 'No Churn',
        'churn_probability': probability,
        'confidence': 'High' if abs(probability - 0.5) > 0.3 else 'Low'
    }

# Example usage
sample_customer = {
    'age': 35,
    'income': 65000,
    'gender': 'Female',
    'city': 'Chicago'
}

result = predict_customer_churn(sample_customer)
print(f"Prediction: {result}")

# Batch prediction example
new_customers = pd.DataFrame([
    {'age': 28, 'income': 45000, 'gender': 'Male', 'city': 'NYC'},
    {'age': 42, 'income': 85000, 'gender': 'Female', 'city': 'Boston'}
])

batch_predictions = loaded_pipeline.predict_proba(new_customers)[:, 1]
print(f"Batch churn probabilities: {batch_predictions}")

The ColumnTransformer is the factory manager that assigns work to different stations.
The Pipeline is the assembly line that connects all processing steps seamlessly.

4.3 Saving & Loading Models

Option 1: Save the Complete Pipeline (Recommended)

Since the Pipeline already contains both preprocessing steps and the model, you don’t need to save them separately. This makes deployment super easy.

  
import joblib

# Save the complete pipeline (preprocessor + model)
joblib.dump(pipe, "model_pipeline.pkl")

# Later: load it back
loaded_pipe = joblib.load("model_pipeline.pkl")

# Use directly for prediction
predictions = loaded_pipe.predict(new_data)

📦 Why save the complete pipeline?

Deployment ready: Everything bundled together for production use
No rebuild needed: Just load and predict on new data
Consistency guaranteed: Same preprocessing applied automatically

💡 Use Case:
A company builds a customer churn prediction model. Saving the complete pipeline means you can load it next month and immediately call .predict() on new customers.

Option 2: Save Only the Preprocessor

Sometimes you want to reuse the same preprocessing steps for different models or experiments.

  
# Save only the preprocessor
joblib.dump(preprocessor, "preprocessor.pkl")

# Load it back
loaded_preprocessor = joblib.load("preprocessor.pkl")

# Transform new data with the same steps
X_transformed = loaded_preprocessor.transform(new_data)

📦 Why save only the preprocessor?

Experimentation: Try multiple models with identical preprocessing
Modular approach: Separate data preparation from model training
Flexibility: Mix and match preprocessors with different algorithms

💡 Use Case:
You preprocess survey data (scale ages, encode gender). Save the preprocessor once, then train both RandomForest and GradientBoosting models using the exact same transformed features.

🍽️ Recipe Analogy:

Complete Pipeline = Save the recipe + cooking steps + final dish → serve immediately anytime
Preprocessor Only = Save the ingredient prep steps → use for different dishes (different models)

🎯 Best Practice:

For deployment/production → Save the complete pipeline
For experimentation → Save the preprocessor separately

5. Jupyter Notebook Example

Link to the Jupyter Notebook

https://github.com/linsnotes/ml-notebooks/blob/main/ml_pipeline_imbalance.ipynb

machine-learning

This post is licensed under CC BY 4.0 by the author.

1. Supervised Learning

1.1 Linear Models

1.2 Trees & Ensembles

1.3 Support Vector Machines

1.4 Nearest Neighbors

1.5 Naive Bayes

1.6 Gaussian Processes & Neural Nets

Neural Networks (MLP)

Gaussian Processes

2. Unsupervised Learning

2.1 Clustering

2.2 Dimensionality Reduction

2.3 Manifold Learning

2.4 Density Estimation

2.5 Unsupervised Learning Summary

3. Data Preprocessing

3.1 Scaling & Normalization

3.2 Encoding & Imputation

3.3 Feature Generation & Selection

4. Pipelines & Workflows

4.1 Pipeline & ColumnTransformer

4.2 End-to-End Machine Learning Pipeline

Step 1. Data Loading & Splitting

Step 2. Preprocessing Setup

Step 3. Base Model Configuration

Step 4. Strategy Definition

Step 5. Cross-Validation Comparison

Step 6. Final Model Training 🏆

Step 7. Test Set Evaluation 📊

Step 8. Model Persistence 💾

Step 9. Deployment Demo 🚀

4.3 Saving & Loading Models

Option 1: Save the Complete Pipeline (Recommended)

Option 2: Save Only the Preprocessor

5. Jupyter Notebook Example

Trending Tags