ML in Practice
Walk through the real machine learning workflow — from messy data to a deployed model — and learn the essential tools, pitfalls, and next steps every practitioner needs.
You've learned what ML is, studied a handful of algorithms, and maybe even trained a model or two. But real-world ML doesn't start at "run the algorithm". It starts with a messy spreadsheet, a vague business question, and a lot of detective work. In practice, the algorithm itself is often the smallest part of the job. Knowing the full workflow — every stage from problem definition to production monitoring — is what separates someone who read about ML from someone who actually ships it.
#The Seven Stages — A Walkthrough
Think of an ML project as a pipeline: (1) Define the problem — write one crisp sentence about what you're predicting and how you'll measure success. (2) Gather and clean data — raw data is almost never ready to use. (3) Split the data — reserve a test set before touching anything else. (4) Pick a model — start simple; earn the right to complexity later. (5) Train — run the learning algorithm on your training set. (6) Evaluate — measure performance on unseen data. (7) Tune, deploy, and monitor — improve, ship, and keep watching.
Each stage has its own traps. Let's walk through them.
#Stages 1 & 2 — Define the Problem and Clean the Data
Common data problems include missing values, wrong data types (a number stored as text like "$4,500"), outliers from data entry errors, duplicate rows, and imbalanced classes (99% of examples are not-fraud, making it easy to look accurate while learning nothing). Here's a minimal cleaning pass:
# Simulate a tiny raw dataset with problems
raw = [
{"age": 25, "income": 50000, "churned": 0},
{"age": None, "income": 62000, "churned": 1}, # missing age
{"age": 999, "income": 48000, "churned": 0}, # outlier age
{"age": 31, "income": 55000, "churned": 0},
{"age": 31, "income": 55000, "churned": 0}, # duplicate
]
clean = []
seen = set()
for row in raw:
key = (row["age"], row["income"], row["churned"])
if row["age"] is None: continue # drop missing
if row["age"] > 120: continue # drop outlier
if key in seen: continue # drop duplicate
seen.add(key)
clean.append(row)
print(f"Raw rows: {len(raw)}, Clean rows: {len(clean)}")Data is the Ingredient, the Model is the Recipe
A world-class chef with rotten vegetables still produces a bad dish. No matter how clever your algorithm, garbage data produces a garbage model. This is so common in ML that practitioners have a saying: "garbage in, garbage out." Spending 70% of your project time on data collection and cleaning is completely normal — and worth it.
Stage 3 — Split First (Data Leakage Is the Silent Killer)
Once your data is clean, the very first thing you do is set aside a test set (typically 20%) that the model will never see during training — your unbiased final exam.
Data leakage happens when information from the test set sneaks into training. Classic mistake: computing the average salary across all rows (including test rows) and using it as a feature. The model has effectively "peeked" at test data — its measured accuracy looks great but falls apart in production. The fix: always split first, then engineer features using only training-set statistics.
#Stages 4, 5 & 6 — Pick, Train, and Evaluate
Picking a model is less about finding the perfect algorithm and more about starting sensibly. For classification, start with Logistic Regression or a Decision Tree. For regression, start with Linear Regression. Upgrade to Random Forest or Gradient Boosting only after benchmarking the simple baseline.
Training is running the algorithm on your training set — in real projects one scikit-learn line handles this. Evaluation is measuring performance on the test set. The right metric matters: - Accuracy — fraction correct. Misleading when classes are imbalanced. - Precision / Recall — for classification, how often is the model right when it says positive, and how many true positives does it catch? - Mean Absolute Error (MAE) — for regression, the average gap between prediction and reality.
y_true = [0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1, 1, 1, 1] # one false negative, one false positive
correct = sum(t == p for t, p in zip(y_true, y_pred))
accuracy = correct / len(y_true)
tp = sum(t == 1 and p == 1 for t, p in zip(y_true, y_pred))
fp = sum(t == 0 and p == 1 for t, p in zip(y_true, y_pred))
fn = sum(t == 1 and p == 0 for t, p in zip(y_true, y_pred))
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")#Stage 7 — Tune, Deploy, and Monitor
Tuning means adjusting hyperparameters — settings you choose before training, like how deep a decision tree grows. You try different values on a held-out validation set and keep the best. Deployment puts the model somewhere real users can call it: a REST API, a nightly batch job, or a function in a mobile app.
Monitoring is the step most beginners forget. The world changes — customer behaviour in January can look nothing like July. A model that was 92% accurate at launch can silently drift to 70% over time (model drift). Good practitioners watch performance dashboards and retrain when accuracy slips. The workflow is a cycle, not a straight line.
The essential toolkit: NumPy (fast arrays), pandas (spreadsheet-style data wrangling), scikit-learn (plug-and-play algorithms and metrics), matplotlib/seaborn (visualisation), and Jupyter Notebooks (interactive exploration). For next steps, Kaggle's beginner competitions (house prices, Titanic) give you real data and a community to learn from.
A data scientist builds a churn-prediction model. Before splitting the data, she computes the average account balance across the entire dataset and adds it as a feature. What problem does this introduce?
Key takeaways
- The ML workflow has seven stages — problem definition, data cleaning, splitting, model selection, training, evaluation, and deployment/monitoring. The algorithm is just one piece.
- Always split your data before any feature engineering to prevent data leakage — information sneaking from the test set into training will make performance look artificially good.
- Pick the simplest model first. Earn the right to use a complex one by benchmarking against a simple baseline.
- Choose your evaluation metric carefully — accuracy alone is misleading on imbalanced datasets; also check precision, recall, or MAE depending on your problem.
- Deployment is not the finish line — models drift over time as the world changes, so monitoring and periodic retraining are part of the job.
This is a minimal data-cleaning pass over a tiny raw dataset with problems. What does it print?
raw = [
{"age": 40, "income": 70000}, # valid
{"age": None, "income": 52000}, # missing age
{"age": 150, "income": 60000}, # outlier age
{"age": 33, "income": 55000}, # valid
{"age": 33, "income": 55000}, # duplicate
]
clean = []
seen = set()
for row in raw:
if row["age"] is None: continue
if row["age"] > 120: continue
key = (row["age"], row["income"])
if key in seen: continue
seen.add(key)
clean.append(row)
print(f"Raw rows: {len(raw)}, Clean rows: {len(clean)}")A fraud model is evaluated on an imbalanced test set (only 1 of 10 cases is fraud). The model lazily predicts 0 (not-fraud) for everything. What does it print?
y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # always predicts 'not fraud'
correct = sum(t == p for t, p in zip(y_true, y_pred))
accuracy = correct / len(y_true)
print(f"Accuracy: {accuracy:.2f}")Complete the precision calculation. Precision asks: when the model says positive, how often is it right? Fill in the denominator.
tp = sum(t == 1 and p == 1 for t, p in zip(y_true, y_pred)) fp = sum(t == 0 and p == 1 for t, p in zip(y_true, y_pred)) fn = sum(t == 1 and p == 0 for t, p in zip(y_true, y_pred)) precision = tp / () if (tp + fp) > 0 else 0
Put the seven stages of the ML workflow into the correct order, from the start of a project to keeping it healthy in production.
Evaluate on the untouched test set using the right metric
Tune, deploy, and monitor for drift over time
Train the model on the training set
Gather and clean the data — handle missing values, outliers, and duplicates
Define the problem — one crisp sentence on what you predict and how you measure success
Pick a model — start simple, like Logistic Regression or a Decision Tree
Split the data — set aside a test set before touching anything else
Write a function evaluate(y_true, y_pred) that takes two lists of 0/1 labels and returns a dictionary with four keys: "accuracy", "precision", "recall", and "f1". The F1 score is the harmonic mean of precision and recall: 2 * precision * recall / (precision + recall). Handle the edge case where precision + recall is 0 by returning 0.0 for F1. Test your function on the sample data provided.
Try it live — edit the code and hit Run to execute real Python: