Machine Learning Algorithms Explained End-to-End Pipeline Guide
A comprehensive guide to machine learning models in scikit-learn with beginner-friendly explanations, analogies.
1. Supervised Learning
Concept: We know the target labels – like having the answer key while practicing.
1.1 Linear Models
With regularization options
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.linear_model import (
LinearRegression, Ridge, Lasso, ElasticNet,
LogisticRegression, SGDClassifier, SGDRegressor,
Perceptron, PassiveAggressiveClassifier,
BayesianRidge, ARDRegression
)
LinearRegression()
Ridge()
Lasso()
ElasticNet()
LogisticRegression()
SGDClassifier()
SGDRegressor()
Perceptron()
PassiveAggressiveClassifier()
BayesianRidge()
ARDRegression()
📦 Key Concepts:
- Classification → predict categories (e.g., spam vs. not spam)
- Regression → predict numbers (e.g., house prices)
- Regularization → Ridge, Lasso, ElasticNet prevent overfitting
💡 Use Cases:
- Predicting house prices (
LinearRegression
,Ridge
,Lasso
) - Email spam detection (
LogisticRegression
) - Online learning with streaming data (
SGDClassifier
,PassiveAggressiveClassifier
)
1.2 Trees & Ensembles
1
2
3
4
5
6
7
8
9
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import (
RandomForestClassifier, RandomForestRegressor,
ExtraTreesClassifier, ExtraTreesRegressor,
GradientBoostingClassifier, GradientBoostingRegressor,
HistGradientBoostingClassifier, HistGradientBoostingRegressor,
AdaBoostClassifier, AdaBoostRegressor,
BaggingClassifier, BaggingRegressor
)
What are Trees?
Analogy: Think of a flowchart for making decisions. Like when choosing what to wear: “Is it raining? → If yes, take umbrella → If no, check temperature → If cold, wear jacket…”
A Decision Tree asks a series of yes/no questions about your data to reach a final decision. Each question splits the data into smaller groups until you get a clear answer.
What are Ensembles?
Analogy: Instead of asking one expert for advice, you ask multiple experts and take the majority vote. Like asking 10 friends “Should I watch this movie?” and going with what most of them say.
Ensembles combine many decision trees together:
- Random Forest = Ask 100 different tree “experts,” each looking at random parts of the data
- Training: All trees work in parallel (like 100 chefs cooking different dishes simultaneously)
- Fast training because trees don’t wait for each other
- Gradient Boosting = Trees learn from each other’s mistakes, getting better step by step
- Training: Trees work sequentially (like students in a relay race, each learning from the previous one)
- Slower training but often more accurate because each tree fixes the previous tree’s errors
📦 Key Concepts:
- Work for both classification & regression
- No explicit regularization, but trees/ensembles resist overfitting with parameters (e.g.,
max_depth
,n_estimators
)
💡 Use Cases:
- Customer churn prediction (
RandomForestClassifier
) - Sales forecasting (
GradientBoostingRegressor
) - Fast interpretable rules (
DecisionTreeClassifier
)
1.3 Support Vector Machines
1
from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR
What is SVM?
Analogy: Imagine you’re in the school playground during recess. The basketball kids always hang out on one side, and the soccer kids hang out on the other side.
Your job is to draw a chalk line down the middle to separate them. But you want to make the widest possible gap so neither group accidentally crosses into the other’s territory. You find the safest spot where there’s the most space between the closest basketball kid and the closest soccer kid.
That’s what SVM does - it finds the best line with the biggest safety zone on both sides!
Why is SVM Special?
Analogy: Sometimes the playground is messy - basketball kids and soccer kids are all mixed up, and you can’t draw a straight line to separate them.
But SVM has a magic trick! It’s like climbing up to the school roof to look down. From up high, you can suddenly see a clear way to separate the groups that you couldn’t see from ground level.
SVM can “look” at your data from different angles (called “kernels”) to find patterns that aren’t obvious at first glance.
Simple Examples:
- Straight line separation → Like dividing boys and girls in class with a straight line
- Curved separation → Like drawing a circle around all the 7th graders vs 8th graders
📦 Key Concepts:
- SVC/LinearSVC → Sorting things into groups (cats vs dogs, spam vs real emails)
- SVR/LinearSVR → Predicting numbers (like guessing someone’s height based on their shoe size)
- Parameter
C
→ How strict you want to be (strict teacher vs lenient teacher)
💡 Use Cases:
- Recognizing handwritten numbers → Telling apart 0, 1, 2, 3, etc.
- Email spam detection → Separating junk mail from important emails
- Photo recognition → Identifying cats vs dogs in pictures
- Medical diagnosis → Helping doctors spot diseases from test results
1.4 Nearest Neighbors
1
2
3
4
from sklearn.neighbors import (
KNeighborsClassifier, KNeighborsRegressor,
RadiusNeighborsClassifier, RadiusNeighborsRegressor
)
What is Nearest Neighbors? Analogy: Imagine you’re new to school and want to know what’s cool to wear. You look at the 5 kids sitting closest to you and see what they’re wearing. If 3 out of 5 are wearing sneakers, you decide sneakers are the way to go!
That’s exactly how k-Nearest Neighbors works - it looks at the k closest examples in your data and makes decisions based on what the majority of neighbors are doing.
How Does It Work? Two Simple Steps:
- Find the closest neighbors → Like finding the 5 students with the most similar test scores to yours
- Copy what they do → If most got an A, predict you’ll get an A too
Key Differences:
- KNeighbors → You choose exactly k neighbors (like always asking exactly 5 friends)
- RadiusNeighbors → You choose everyone within a distance (like asking all friends within 10 seats of you)
📦 Key Concepts:
- Classification → What category? (Will this student like pizza or burgers?)
- Regression → What number? (How tall will this student be?)
- No training needed → Just memorizes all examples (like keeping a phone book)
- Parameter
k
→ How many neighbors to ask (more neighbors = safer but less specific)
💡 Real-World Use Cases:
- Movie recommendations → “People who liked Spider-Man also liked Batman”
- Housing prices → “Houses in your neighborhood sold for $300k, so yours might too”
- Medical diagnosis → “Patients with similar symptoms usually have this condition”
- E-commerce → “Customers who bought this also bought that”
- Image recognition → “This photo looks most like pictures labeled ‘cat’”
- Fraud detection → “This transaction looks like previous fraudulent ones”
🎯 When to Use:
- Small datasets → Fast and simple
- Irregular patterns → No assumptions about data shape
- Local patterns matter → Nearby examples are most important
- Quick prototyping → No training time needed
1.5 Naive Bayes
1
2
3
from sklearn.naive_bayes import (
GaussianNB, MultinomialNB, BernoulliNB, CategoricalNB, ComplementNB
)
What is Naive Bayes?
Analogy: Imagine you’re a detective trying to guess if someone is a basketball player. You notice they’re tall, wearing sneakers, and have a sports bag. You think:
- “90% of tall people I know play basketball”
- “80% of people wearing sneakers play sports”
- “70% of people with sports bags play basketball”
Naive Bayes combines all these probabilities to make a smart guess. It’s called “naive” because it assumes each clue is independent (height doesn’t affect shoe choice), which isn’t always true, but it works surprisingly well!
How Does It Work?
Simple Steps:
- Learn patterns → “In spam emails, ‘FREE’ appears 80% of the time”
- Calculate probabilities → “This email has ‘FREE’ and ‘URGENT’, what are the chances it’s spam?”
- Pick the most likely → “95% chance spam, 5% chance normal email → It’s spam!”
Different Types:
- GaussianNB → For normal numbers (height, weight, temperature)
- MultinomialNB → For counting things (word frequency in emails)
- BernoulliNB → For yes/no features (does email contain “FREE”?)
- CategoricalNB → For categories (color: red/blue/green)
- ComplementNB → Special version of Multinomial for imbalanced data
📦 Key Concepts:
- Classification only → Sorting things into categories
- Probabilistic → Gives you confidence scores (“85% sure this is spam”)
- Super fast → Trains in seconds, predicts instantly
- Handles missing data → Still works when some information is missing
- Works with small data → Doesn’t need thousands of examples
💡 Real-World Use Cases:
- Email spam filtering → Classic use case, still widely used today
- Text sentiment analysis → “Is this movie review positive or negative?”
- News categorization → Sports, Politics, Entertainment, etc.
- Medical diagnosis → Based on symptoms and test results
- Real-time recommendations → Fast enough for live website suggestions
- Social media monitoring → Classify tweets as positive/negative/neutral
- Document classification → Sort legal documents, research papers
- Language detection → “Is this text English, Spanish, or French?”
🎯 Why Choose Naive Bayes:
- Lightning fast → Train and predict in milliseconds
- Simple to understand → Easy to explain to non-technical people
- Works with small data → Good baseline even with limited examples
- Handles text beautifully → Excellent for any word-based analysis
- Probability scores → Not just predictions, but confidence levels
- Robust baseline → Often surprisingly competitive with complex models
1.6 Gaussian Processes & Neural Nets
1
2
from sklearn.gaussian_process import GaussianProcessClassifier, GaussianProcessRegressor
from sklearn.neural_network import MLPClassifier, MLPRegressor
What are These Advanced Models?
Neural Networks (MLP)
Analogy: Think of your brain making a decision. You have layers of neurons - some take in information (eyes see a dog), middle layers process it (recognize fur, tail, ears), and final layers decide (yep, it’s a dog!).
Multi-Layer Perceptron (MLP) works the same way:
- Input layer → Takes your data (age, income, etc.)
- Hidden layers → Like brain neurons, find complex patterns
- Output layer → Makes final decision or prediction
Why Neural Networks Are Special:
- Can learn ANY pattern → Like a super-flexible curve that bends to fit your data perfectly
- Deep learning lite → Baby version of the AI that powers ChatGPT and image recognition
- Self-adjusting → Automatically finds the best way to combine features
Gaussian Processes
Analogy: Imagine you’re drawing a map of unknown territory. You have a few known points, but need to guess what’s between them. Gaussian Processes draw the smoothest possible lines between known points, plus give you confidence intervals (how sure you are about each guess).
Why Gaussian Processes Are Special:
- Uncertainty quantification → Tells you “I’m 90% confident the answer is between 5-7”
- Works with tiny datasets → Can make good guesses with just 10-20 examples
- No overfitting → Built-in protection against memorizing noise
📦 Key Concepts:
- Both classification & regression → Sort categories or predict numbers
- Non-linear patterns → Can handle curves, waves, and complex relationships
- Neural Net regularization →
alpha
parameter prevents overfitting (like dropout) - Probabilistic predictions → Gaussian Processes give confidence scores
💡 Real-World Use Cases:
Neural Networks (MLP):
- Medical diagnosis → Combining symptoms, test results, and patient history
- Financial prediction → Stock prices, credit scoring, fraud detection
- Marketing optimization → Customer lifetime value, ad targeting
- Engineering → Quality control, sensor data analysis
- Gaming AI → Simple game agents and decision-making
- Pattern recognition → Basic image/audio classification
Gaussian Processes:
- Scientific experiments → Optimize experiments with minimal trials
- Robotics → Safe path planning with uncertainty
- Environmental modeling → Climate data with sparse measurements
- Drug discovery → Molecular property prediction with limited data
- A/B testing → Website optimization with uncertainty bounds
- Sensor networks → Interpolate readings between sensor locations
🎯 When to Use:
Choose Neural Networks (MLP) when:
- Complex, non-linear patterns → Data has weird curves and interactions
- Medium-to-large datasets → You have thousands of examples
- High accuracy needed → Willing to trade interpretability for performance
- Multiple features interact → Features depend on each other in complex ways
Choose Gaussian Processes when:
- Small datasets → You only have 10-100 examples
- Uncertainty matters → Need to know “how sure” you are
- Smooth functions → Expecting gradual changes, not sudden jumps
- Scientific applications → Need mathematically principled uncertainty
2. Unsupervised Learning
👉 Analogy: Imagine you walk into a new city where you don’t know the neighborhoods. Nobody tells you which area is “rich” or “student-friendly” — you just explore and find patterns yourself. That’s what unsupervised learning does: it groups or simplifies data without labels.
2.1 Clustering
1
2
3
4
5
from sklearn.cluster import (
KMeans, MiniBatchKMeans, DBSCAN, OPTICS,
AgglomerativeClustering, SpectralClustering,
Birch, MeanShift
)
📦 Analogy: Imagine you walk into a grocery store and observe shoppers. You notice some people always buy fruits, some buy only snacks, and some buy vegetables. Without anyone telling you, you can cluster shoppers into groups based on their behavior.
📊 Common Algorithms:
KMeans
: Like putting customers into k shopping baskets (you decide k beforehand)DBSCAN
: Like spotting groups of friends at a party — dense groups are clusters, lonely people are “outliers”AgglomerativeClustering
: Like building a family tree of similarities, merging closest relatives step by step
💡 Use Cases:
- Group customers by shopping behavior (
KMeans
) - Fraud detection and anomaly identification (
DBSCAN
)
2.2 Dimensionality Reduction
1
2
3
4
from sklearn.decomposition import (
PCA, IncrementalPCA, KernelPCA, TruncatedSVD,
NMF, FactorAnalysis, FastICA
)
📦 Analogy: Imagine you have a huge recipe book written in many languages. Instead of keeping all words, you summarize into main themes: “sweet,” “spicy,” “vegan.” You keep fewer dimensions but still capture the essential meaning.
📊 Common Algorithms:
PCA
: Like compressing a photo — fewer pixels, but you still recognize the pictureNMF
: Like summarizing movie reviews into topics: “acting,” “visuals,” “story”FactorAnalysis
: Like finding hidden factors behind test scores (math ability, reading skills, memory)
💡 Use Cases:
- Reduce image dimensions before classification (
PCA
) - Topic discovery in text documents (
NMF
)
2.3 Manifold Learning
📦 Analogy: Imagine you have a huge recipe book written in many languages. Instead of keeping all words, you summarize into main themes: “sweet,” “spicy,” “vegan.” You keep fewer dimensions but still capture meaning.
📊 Algorithms & Uses:
PCA
: Like compressing a photo — fewer pixels, but you still recognize the picture.NMF
: Like summarizing movie reviews into topics: “acting,” “visuals,” “story.”FactorAnalysis
: Like finding hidden factors behind test scores (math, reading, memory).
💡 Use cases:
- Reduce image size before classification (
PCA
) - Topic discovery in text (
NMF
)
2.4 Density Estimation
1
2
from sklearn.mixture import GaussianMixture, BayesianGaussianMixture
from sklearn.neighbors import KernelDensity
📦 Analogy: Imagine you’re at a party with no seating chart. People naturally form groups: some around the food, some near the music. Density estimation finds these dense spots and describes them mathematically.
📊 Common Algorithms:
GaussianMixture
: Like assuming guests belong to overlapping groups (music lovers, food lovers, social butterflies)KernelDensity
: Like putting a flashlight on the floor — brighter spots indicate more people (higher density)
💡 Use Cases:
- Anomaly detection in sensor data (
GaussianMixture
) - Estimate probability distributions of house prices (
KernelDensity
)
2.5 Unsupervised Learning Summary
- Clustering → “Grouping people without name tags”
- Dimensionality Reduction → “Summarizing a big book into key themes”
- Manifold Learning → “Unfolding crumpled paper to see the true shape”
- Density Estimation → “Finding hotspots at a party”
3. Data Preprocessing
📦 Concept: Like “kitchen prep” before cooking — you clean, cut, and season ingredients so the final dish (model) works perfectly.
1
2
3
4
5
# Common preprocessing imports
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest
3.1 Scaling & Normalization
1
from sklearn.preprocessing import StandardScaler, MinMaxScaler
Analogy: Imagine comparing athletes: one runs in seconds, one jumps in meters, one lifts in kilograms. To compare them fairly, you rescale everything to the same units.
What it does:
StandardScaler
: Makes features have mean=0, std=1 (like z-scores)MinMaxScaler
: Rescales features to [0,1] range
💡 Use Case:
1
2
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[["age", "income"]])
Normalizes age & income so neither dominates the model.
3.2 Encoding & Imputation
1
2
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
Analogy:
- Encoding: Like ordering coffee in another country — you point at icons instead of words. Categorical values become numbers.
- Imputation: If someone skips a survey question, you fill in the blank with the most common or average answer.
What it does:
OneHotEncoder
: Turns categories into binary flags (red → [1,0,0], blue → [0,1,0], green → [0,0,1])SimpleImputer
: Fills missing values with mean, median, or most frequent value
💡 Use Case:
1
2
3
4
5
6
7
# Encode categorical data
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[["color"]])
# Fill missing values
imputer = SimpleImputer(strategy="mean")
df["age"] = imputer.fit_transform(df[["age"]])
3.3 Feature Generation & Selection
1
2
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest
Analogy:
- Feature generation: Like adding “secret ingredients” — flour and water can create bread, a completely new feature
- Feature selection: Like a talent show — only the best performers make it to the final
📊 What it does:
PolynomialFeatures
: Creates squared/cubic terms (e.g., area = length²)SelectKBest
: Keeps only the most useful features based on statistical tests
💡 Use Case:
1
2
3
4
5
6
7
# Generate polynomial features
poly = PolynomialFeatures(degree=2)
df_poly = poly.fit_transform(df[["size"]])
# Select best features
selector = SelectKBest(k=5)
df_selected = selector.fit_transform(df.drop("target", axis=1), df["target"])
4. Pipelines & Workflows
4.1 Pipeline & ColumnTransformer
1
2
3
4
5
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
Restaurant Kitchen Analogy:
Think of a restaurant kitchen:
- Vegetables (numeric features like
age
,income
) → washing + chopping station (scaling) - Spices (categorical features like
color
,city
) → grinding station (encoding) - Finally, all ingredients are combined and sent to the main chef (the model)
The ColumnTransformer
is the kitchen manager: it decides which ingredient goes to which station.
The Pipeline
is the conveyor belt that connects preprocessing → model.
🔧 Pipeline Workflow: Building Your ML Assembly Line
🏭 Assembly Line Analogy: Think of a car factory assembly line:
- Station 1: Paint numeric parts (scaling)
- Station 2: Assemble categorical parts (encoding)
- Station 3: Quality control (the ML model)
- Conveyor belt: Pipeline that moves everything smoothly from start to finish
4.2 End-to-End Machine Learning Pipeline
Step 1. Data Loading & Splitting
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import joblib
from copy import deepcopy
# Load and explore data
df = pd.read_csv("customer_data.csv")
print(f"Data shape: {df.shape}")
print(f"Target distribution:\n{df['target'].value_counts()}")
# Stratified split (keeps same proportion of target classes)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Identify feature types
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
print(f"Numeric features: {numeric_features}")
print(f"Categorical features: {categorical_features}")
Step 2. Preprocessing Setup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Create preprocessing pipelines
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
# Combine preprocessing steps
base_preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
Step 3. Base Model Configuration
1
2
3
4
5
# Base models to experiment with
base_models = {
'LogisticRegression': LogisticRegression(random_state=42),
'RandomForest': RandomForestClassifier(random_state=42, n_estimators=100),
}
Step 4. Strategy Definition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Create multiple strategies (preprocessing + resampling + model variations)
strategies = {}
for model_name, model in base_models.items():
# Strategy 1: Basic preprocessing + model
strategies[f"{model_name}_basic"] = Pipeline([
('preprocessor', deepcopy(base_preprocessor)),
('classifier', deepcopy(model))
])
# Strategy 2: Preprocessing + undersampling + model
strategies[f"{model_name}_undersample"] = Pipeline([
('preprocessor', deepcopy(base_preprocessor)),
('undersampler', RandomUnderSampler(random_state=42)),
('classifier', deepcopy(model))
])
# Strategy 3: Preprocessing + SMOTE oversampling + model
strategies[f"{model_name}_smote"] = Pipeline([
('preprocessor', deepcopy(base_preprocessor)),
('oversampler', SMOTE(random_state=42)),
('classifier', deepcopy(model))
])
# Strategy 4: Class weight balanced
if hasattr(model, 'class_weight'):
balanced_model = deepcopy(model)
balanced_model.class_weight = 'balanced'
strategies[f"{model_name}_balanced"] = Pipeline([
('preprocessor', deepcopy(base_preprocessor)),
('classifier', balanced_model)
])
Step 5. Cross-Validation Comparison
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Compare all strategies using cross-validation
cv_scores = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("Cross-validation results:")
print("-" * 50)
for strategy_name, pipeline in strategies.items():
scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='roc_auc')
cv_scores[strategy_name] = {
'mean_score': scores.mean(),
'std_score': scores.std(),
'scores': scores
}
print(f"{strategy_name:25} | ROC-AUC: {scores.mean():.4f} (±{scores.std():.4f})")
# Rank strategies by performance
ranked_strategies = sorted(cv_scores.items(), key=lambda x: x[1]['mean_score'], reverse=True)
best_strategy_name = ranked_strategies[0][0]
print(f"\nBest strategy: {best_strategy_name}")
Step 6. Final Model Training 🏆
1
2
3
4
5
# Train the best strategy on full training data
best_pipeline = strategies[best_strategy_name]
best_pipeline.fit(X_train, y_train)
print(f"Final model trained: {best_strategy_name}")
Step 7. Test Set Evaluation 📊
1
2
3
4
5
6
7
8
9
10
11
12
13
# Evaluate on held-out test set
y_pred = best_pipeline.predict(X_test)
y_pred_proba = best_pipeline.predict_proba(X_test)[:, 1]
# Comprehensive evaluation
test_auc = roc_auc_score(y_test, y_pred_proba)
print(f"Test Set ROC-AUC: {test_auc:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Step 8. Model Persistence 💾
1
2
3
4
5
6
7
8
9
10
# Save complete pipeline for deployment
model_filename = f"best_model_{best_strategy_name}.pkl"
joblib.dump(best_pipeline, model_filename)
# Save preprocessor separately for flexibility
preprocessor_filename = "preprocessor.pkl"
joblib.dump(base_preprocessor.fit(X_train), preprocessor_filename)
print(f"Model saved: {model_filename}")
print(f"Preprocessor saved: {preprocessor_filename}")
Step 9. Deployment Demo 🚀
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Load saved model
loaded_pipeline = joblib.load(model_filename)
# Production-ready prediction function
def predict_customer_churn(customer_data):
"""
Predict customer churn probability
Args:
customer_data: dict or DataFrame with customer features
Returns:
dict with prediction and probability
"""
if isinstance(customer_data, dict):
customer_data = pd.DataFrame([customer_data])
prediction = loaded_pipeline.predict(customer_data)[0]
probability = loaded_pipeline.predict_proba(customer_data)[0, 1]
return {
'prediction': 'Churn' if prediction == 1 else 'No Churn',
'churn_probability': probability,
'confidence': 'High' if abs(probability - 0.5) > 0.3 else 'Low'
}
# Example usage
sample_customer = {
'age': 35,
'income': 65000,
'gender': 'Female',
'city': 'Chicago'
}
result = predict_customer_churn(sample_customer)
print(f"Prediction: {result}")
# Batch prediction example
new_customers = pd.DataFrame([
{'age': 28, 'income': 45000, 'gender': 'Male', 'city': 'NYC'},
{'age': 42, 'income': 85000, 'gender': 'Female', 'city': 'Boston'}
])
batch_predictions = loaded_pipeline.predict_proba(new_customers)[:, 1]
print(f"Batch churn probabilities: {batch_predictions}")
The ColumnTransformer
is the factory manager that assigns work to different stations.
The Pipeline
is the assembly line that connects all processing steps seamlessly.
4.3 Saving & Loading Models
Option 1: Save the Complete Pipeline (Recommended)
Since the Pipeline
already contains both preprocessing steps and the model, you don’t need to save them separately. This makes deployment super easy.
1
2
3
4
5
6
7
8
9
10
import joblib
# Save the complete pipeline (preprocessor + model)
joblib.dump(pipe, "model_pipeline.pkl")
# Later: load it back
loaded_pipe = joblib.load("model_pipeline.pkl")
# Use directly for prediction
predictions = loaded_pipe.predict(new_data)
📦 Why save the complete pipeline?
- Deployment ready: Everything bundled together for production use
- No rebuild needed: Just load and predict on new data
- Consistency guaranteed: Same preprocessing applied automatically
💡 Use Case:
A company builds a customer churn prediction model. Saving the complete pipeline means you can load it next month and immediately call .predict()
on new customers.
Option 2: Save Only the Preprocessor
Sometimes you want to reuse the same preprocessing steps for different models or experiments.
1
2
3
4
5
6
7
8
# Save only the preprocessor
joblib.dump(preprocessor, "preprocessor.pkl")
# Load it back
loaded_preprocessor = joblib.load("preprocessor.pkl")
# Transform new data with the same steps
X_transformed = loaded_preprocessor.transform(new_data)
📦 Why save only the preprocessor?
- Experimentation: Try multiple models with identical preprocessing
- Modular approach: Separate data preparation from model training
- Flexibility: Mix and match preprocessors with different algorithms
💡 Use Case:
You preprocess survey data (scale ages, encode gender). Save the preprocessor once, then train both RandomForest and GradientBoosting models using the exact same transformed features.
🍽️ Recipe Analogy:
- Complete Pipeline = Save the recipe + cooking steps + final dish → serve immediately anytime
- Preprocessor Only = Save the ingredient prep steps → use for different dishes (different models)
🎯 Best Practice:
- For deployment/production → Save the complete pipeline
- For experimentation → Save the preprocessor separately
5. Jupyter Notebook Example
Link to the Jupyter Notebook
https://github.com/linsnotes/ml-notebooks/blob/main/ml_pipeline_imbalance.ipynb