Putting It TogetherBeginner10 min13 / 13

ML in Practice

Walk through the real machine learning workflow — from messy data to a deployed model — and learn the essential tools, pitfalls, and next steps every practitioner needs.

You've learned what ML is, studied a handful of algorithms, and maybe even trained a model or two. But real-world ML doesn't start at "run the algorithm". It starts with a messy spreadsheet, a vague business question, and a lot of detective work. In practice, the algorithm itself is often the smallest part of the job. Knowing the full workflow — every stage from problem definition to production monitoring — is what separates someone who read about ML from someone who actually ships it.

#The Seven Stages — A Walkthrough

Think of an ML project as a pipeline: (1) Define the problem — write one crisp sentence about what you're predicting and how you'll measure success. (2) Gather and clean data — raw data is almost never ready to use. (3) Split the data — reserve a test set before touching anything else. (4) Pick a model — start simple; earn the right to complexity later. (5) Train — run the learning algorithm on your training set. (6) Evaluate — measure performance on unseen data. (7) Tune, deploy, and monitor — improve, ship, and keep watching.

Each stage has its own traps. Let's walk through them.

#Stages 1 & 2 — Define the Problem and Clean the Data

Common data problems include missing values, wrong data types (a number stored as text like "$4,500"), outliers from data entry errors, duplicate rows, and imbalanced classes (99% of examples are not-fraud, making it easy to look accurate while learning nothing). Here's a minimal cleaning pass:

A minimal data-cleaning pass — real projects do far more, but the logic is identical.
# Simulate a tiny raw dataset with problems
raw = [
    {"age": 25, "income": 50000, "churned": 0},
    {"age": None, "income": 62000, "churned": 1},  # missing age
    {"age": 999, "income": 48000, "churned": 0},  # outlier age
    {"age": 31, "income": 55000, "churned": 0},
    {"age": 31, "income": 55000, "churned": 0},  # duplicate
]

clean = []
seen = set()
for row in raw:
    key = (row["age"], row["income"], row["churned"])
    if row["age"] is None: continue      # drop missing
    if row["age"] > 120:   continue      # drop outlier
    if key in seen:        continue      # drop duplicate
    seen.add(key)
    clean.append(row)

print(f"Raw rows: {len(raw)}, Clean rows: {len(clean)}")
Think of it like

Data is the Ingredient, the Model is the Recipe

A world-class chef with rotten vegetables still produces a bad dish. No matter how clever your algorithm, garbage data produces a garbage model. This is so common in ML that practitioners have a saying: "garbage in, garbage out." Spending 70% of your project time on data collection and cleaning is completely normal — and worth it.

Common mistake

Stage 3 — Split First (Data Leakage Is the Silent Killer)

Once your data is clean, the very first thing you do is set aside a test set (typically 20%) that the model will never see during training — your unbiased final exam.

Data leakage happens when information from the test set sneaks into training. Classic mistake: computing the average salary across all rows (including test rows) and using it as a feature. The model has effectively "peeked" at test data — its measured accuracy looks great but falls apart in production. The fix: always split first, then engineer features using only training-set statistics.

#Stages 4, 5 & 6 — Pick, Train, and Evaluate

Picking a model is less about finding the perfect algorithm and more about starting sensibly. For classification, start with Logistic Regression or a Decision Tree. For regression, start with Linear Regression. Upgrade to Random Forest or Gradient Boosting only after benchmarking the simple baseline.

Training is running the algorithm on your training set — in real projects one scikit-learn line handles this. Evaluation is measuring performance on the test set. The right metric matters: - Accuracy — fraction correct. Misleading when classes are imbalanced. - Precision / Recall — for classification, how often is the model right when it says positive, and how many true positives does it catch? - Mean Absolute Error (MAE) — for regression, the average gap between prediction and reality.

Computing three evaluation metrics from scratch — the formulas are simpler than they look.
y_true = [0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1, 1, 1, 1]  # one false negative, one false positive

correct   = sum(t == p for t, p in zip(y_true, y_pred))
accuracy  = correct / len(y_true)
tp = sum(t == 1 and p == 1 for t, p in zip(y_true, y_pred))
fp = sum(t == 0 and p == 1 for t, p in zip(y_true, y_pred))
fn = sum(t == 1 and p == 0 for t, p in zip(y_true, y_pred))
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall    = tp / (tp + fn) if (tp + fn) > 0 else 0

print(f"Accuracy:  {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")

#Stage 7 — Tune, Deploy, and Monitor

Tuning means adjusting hyperparameters — settings you choose before training, like how deep a decision tree grows. You try different values on a held-out validation set and keep the best. Deployment puts the model somewhere real users can call it: a REST API, a nightly batch job, or a function in a mobile app.

Monitoring is the step most beginners forget. The world changes — customer behaviour in January can look nothing like July. A model that was 92% accurate at launch can silently drift to 70% over time (model drift). Good practitioners watch performance dashboards and retrain when accuracy slips. The workflow is a cycle, not a straight line.

The essential toolkit: NumPy (fast arrays), pandas (spreadsheet-style data wrangling), scikit-learn (plug-and-play algorithms and metrics), matplotlib/seaborn (visualisation), and Jupyter Notebooks (interactive exploration). For next steps, Kaggle's beginner competitions (house prices, Titanic) give you real data and a community to learn from.

Quick check

A data scientist builds a churn-prediction model. Before splitting the data, she computes the average account balance across the entire dataset and adds it as a feature. What problem does this introduce?

Key takeaways

  • The ML workflow has seven stages — problem definition, data cleaning, splitting, model selection, training, evaluation, and deployment/monitoring. The algorithm is just one piece.
  • Always split your data before any feature engineering to prevent data leakage — information sneaking from the test set into training will make performance look artificially good.
  • Pick the simplest model first. Earn the right to use a complex one by benchmarking against a simple baseline.
  • Choose your evaluation metric carefully — accuracy alone is misleading on imbalanced datasets; also check precision, recall, or MAE depending on your problem.
  • Deployment is not the finish line — models drift over time as the world changes, so monitoring and periodic retraining are part of the job.
Practice challenges
Test yourself · earn XP
0/4
Predict the output#1

This is a minimal data-cleaning pass over a tiny raw dataset with problems. What does it print?

predict-output
raw = [
    {"age": 40, "income": 70000},   # valid
    {"age": None, "income": 52000}, # missing age
    {"age": 150, "income": 60000},  # outlier age
    {"age": 33, "income": 55000},   # valid
    {"age": 33, "income": 55000},   # duplicate
]

clean = []
seen = set()
for row in raw:
    if row["age"] is None: continue
    if row["age"] > 120:   continue
    key = (row["age"], row["income"])
    if key in seen:        continue
    seen.add(key)
    clean.append(row)

print(f"Raw rows: {len(raw)}, Clean rows: {len(clean)}")
Predict the output#2

A fraud model is evaluated on an imbalanced test set (only 1 of 10 cases is fraud). The model lazily predicts 0 (not-fraud) for everything. What does it print?

predict-output
y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  # always predicts 'not fraud'

correct  = sum(t == p for t, p in zip(y_true, y_pred))
accuracy = correct / len(y_true)
print(f"Accuracy: {accuracy:.2f}")
Fill in the blank#3

Complete the precision calculation. Precision asks: when the model says positive, how often is it right? Fill in the denominator.

tp = sum(t == 1 and p == 1 for t, p in zip(y_true, y_pred))
fp = sum(t == 0 and p == 1 for t, p in zip(y_true, y_pred))
fn = sum(t == 1 and p == 0 for t, p in zip(y_true, y_pred))

precision = tp / () if (tp + fp) > 0 else 0
Reorder the lines#4

Put the seven stages of the ML workflow into the correct order, from the start of a project to keeping it healthy in production.

1
Evaluate on the untouched test set using the right metric
2
Tune, deploy, and monitor for drift over time
3
Train the model on the training set
4
Gather and clean the data — handle missing values, outliers, and duplicates
5
Define the problem — one crisp sentence on what you predict and how you measure success
6
Pick a model — start simple, like Logistic Regression or a Decision Tree
7
Split the data — set aside a test set before touching anything else
Your turn
Practice exercise

Write a function evaluate(y_true, y_pred) that takes two lists of 0/1 labels and returns a dictionary with four keys: "accuracy", "precision", "recall", and "f1". The F1 score is the harmonic mean of precision and recall: 2 * precision * recall / (precision + recall). Handle the edge case where precision + recall is 0 by returning 0.0 for F1. Test your function on the sample data provided.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable