ML FoundationsBeginner⏱ 9 min03 / 13

Data, Features & Labels

Learn how machine learning sees the world: rows as examples, columns as features, and one special column called the label that the model tries to predict.

Imagine you are a doctor trying to predict whether a patient has a disease. You look at their age, blood pressure, cholesterol level, and whether they smoke. Each of these measurements tells you something useful. A machine learning model works exactly the same way — you hand it a pile of measurements, tell it what the right answer was for each past patient, and it learns patterns that help it predict future patients.

Before any algorithm runs, before any math happens, you need data in the right shape. Getting that shape right is more than half the battle.

#The Spreadsheet View of Data

Picture your data as a classic spreadsheet. Each row is one example — one patient, one email, one house sale, one day of weather. Rows are also called samples or observations.

Each column (except the last special one) is a feature — a measurable property of that example. Features are the inputs you feed the model. The collection of all feature values for a single row is called a feature vector.

The one special column is the label (also called the target or y). This is the answer you already know for past examples, and the thing you want the model to predict for future ones. Features are usually called X; the label is called y.

Think of it like

The Flash Cards Analogy

Think of each row as a flash card. The front of the card shows all the features — age: 45, smoker: yes, cholesterol: 220. The back of the card shows the label — disease: yes.

You study thousands of cards (training). Then someone hands you a new card with only the front filled in and asks you to guess the back. That's exactly what a trained model does.

#Building Feature Rows in Python

You don't need any library to understand features. A feature vector is just a list of numbers. A dataset is just a list of those lists. Let's build one from scratch to make it concrete.

A dataset is just lists of lists — no library required.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16# Each row = one house. Features: [size_sqft, num_rooms, age_years]
# Label (y): price in thousands

X = [
    [1200, 3, 10],
    [850,  2, 25],
    [2100, 5, 3 ],
    [500,  1, 40],
]

y = [250, 180, 420, 95]

print("Number of samples:", len(X))
print("Number of features:", len(X[0]))
print("Feature vector for house 0:", X[0])
print("Label for house 0: $", y[0], "k")

You can also use dictionaries to keep track of which feature is which — especially helpful while exploring your data. In real projects, libraries like pandas turn this into a DataFrame, but the idea is identical.

Separating features from the label is the first real data step in every ML project.

1
2
3
4
5
6
7
8
9
10# Same data as named dicts (easier to read during exploration)
samples = [
    {"size_sqft": 1200, "num_rooms": 3, "age_years": 10, "price_k": 250},
    {"size_sqft": 850,  "num_rooms": 2, "age_years": 25, "price_k": 180},
]

for s in samples:
    features = {k: v for k, v in s.items() if k != "price_k"}
    label    = s["price_k"]
    print("X:", features, "-> y:", label)

#Train, Validation, and Test Sets

Here is a crucial question: once the model has learned from your data, how do you know if it actually learned something general, or just memorized your specific examples?

The answer is to hide some data from the model during training and only use it later to measure performance. We split our dataset into three buckets:

Training set (~70-80%) — the examples the model learns from.
Validation set (~10-15%) — examples we check during development to tune settings (called hyperparameters).
Test set (~10-15%) — examples locked away until the very end, to give an honest final score.

The test set is your blind exam. You never look at it until you are completely done building. If you peek and adjust your model, the score becomes dishonest.

A simple pure-Python split. Real libraries (sklearn.model_selection.train_test_split) do this in one line.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17import random

def train_val_test_split(data, val_ratio=0.15, test_ratio=0.15, seed=42):
    random.seed(seed)
    shuffled = data[:]
    random.shuffle(shuffled)
    n = len(shuffled)
    n_test = int(n * test_ratio)
    n_val  = int(n * val_ratio)
    test   = shuffled[:n_test]
    val    = shuffled[n_test:n_test + n_val]
    train  = shuffled[n_test + n_val:]
    return train, val, test

all_data = list(range(100))  # pretend 100 samples
train, val, test = train_val_test_split(all_data)
print(f"Train: {len(train)}  Val: {len(val)}  Test: {len(test)}")

Common mistake

The Data Leakage Trap

Never use test-set data to make any decision during training — not even to choose how many features to use. If information from the test set 'leaks' into your training process, your final accuracy score will be falsely optimistic. In the real world your model will perform worse than you thought.

Also: always shuffle before splitting, otherwise you might accidentally put all the rare examples in the test set.

#Garbage In, Garbage Out

No algorithm, no matter how clever, can rescue a bad dataset. This is such a well-known truth in ML that it has its own name: GIGO — garbage in, garbage out.

Common data quality problems: - Missing values — a sensor didn't record, a form field was left blank. - Wrong labels — a human annotator made an error. - Biased collection — you only collected data from one city, but you want to predict for the whole country. - Irrelevant features — including a column like 'customer ID' that has nothing to do with the outcome but can confuse some models.

Spending time cleaning and understanding your data is never wasted time.

#Feature Scaling: Why Size Matters

Look back at our house features: [size_sqft, num_rooms, age_years] = [1200, 3, 10]. The first feature is hundreds of times larger than the others just because of the units we chose. Many algorithms (especially those that measure 'distance' between points, or use gradient descent) can be thrown off by this imbalance — they might treat size as far more important than rooms simply because its numbers are bigger.

Feature scaling fixes this by transforming every feature into a comparable range. The two most common methods are:

Min-Max normalization — squeezes values into [0, 1]. Formula in words: subtract the minimum, then divide by (max minus min).
Standardization (Z-score) — shifts values so they have mean 0 and standard deviation 1. Formula in words: subtract the mean, then divide by the standard deviation.

Neither method changes which data point is largest or smallest — it only rescales the axis.

Both methods put features on a level playing field without losing information.

1
2
3
4
5
6
7
8
9
10
11
12
13
14def min_max_scale(values):
    mn, mx = min(values), max(values)
    return [(v - mn) / (mx - mn) for v in values]

def z_score_scale(values):
    n = len(values)
    mean = sum(values) / n
    std  = (sum((v - mean) ** 2 for v in values) / n) ** 0.5
    return [(v - mean) / std for v in values]

sizes = [1200, 850, 2100, 500]
print("Original:    ", sizes)
print("Min-Max:     ", [round(v, 2) for v in min_max_scale(sizes)])
print("Z-score:     ", [round(v, 2) for v in z_score_scale(sizes)])

Tip

When Do You Need Scaling?

Always scale for: k-nearest neighbors, support vector machines, neural networks, linear/logistic regression with gradient descent.

Scaling doesn't matter for: decision trees and random forests — they split on thresholds, so the scale of a feature is irrelevant.

When in doubt, scale. It rarely hurts and often helps.

Quick check

You are building a model to predict apartment rent. Your dataset has columns: apartment_id, size_m2, floor, distance_to_metro_km, rent_euros. Which column is the label (y)?

Note

In Real Projects

Libraries like pandas give you DataFrames that make all of this easier to inspect and manipulate. scikit-learn provides train_test_split, MinMaxScaler, and StandardScaler out of the box. But they are doing exactly what you just wrote by hand — now you know what's happening inside.

Key takeaways

Rows are samples/examples; columns are features (X); the one column you want to predict is the label (y).
A feature vector is simply the list of feature values for one sample.
Always split data into train, validation, and test sets — the test set is a sealed envelope you open only once.
Garbage in, garbage out: no algorithm compensates for bad, biased, or mislabeled data.
Feature scaling (min-max or z-score) puts all features on the same playing field, which matters for distance-based and gradient-based algorithms.

Practice challenges

Test yourself · earn XP

0/5

Predict the output#1

This dataset holds house feature vectors. What does the code print?

predict-output

1
2
3
4
5
6
7
8X = [
    [1200, 3, 10],
    [850,  2, 25],
    [2100, 5, 3 ],
]

print("Number of samples:", len(X))
print("Number of features:", len(X[0]))

Predict the output#2

Min-max normalization squeezes values into the [0, 1] range. What does this print?

predict-output

1
2
3
4
5
6def min_max_scale(values):
    mn, mx = min(values), max(values)
    return [(v - mn) / (mx - mn) for v in values]

rooms = [3, 2, 5, 1]
print([round(v, 2) for v in min_max_scale(rooms)])

Fix the bug#3

This code separates features (X) from the label (y) for house-price data. It has a bug — what's wrong?

fix-bug

1
2
3
4
5
6sample = {"size_sqft": 1200, "num_rooms": 3, "age_years": 10, "price_k": 250}

features = {k: v for k, v in sample.items()}
label    = sample["price_k"]

print("X:", features, "-> y:", label)

Fill in the blank#4

We are splitting a shuffled dataset into test, validation, and train slices. Fill in the slice that takes the FIRST n_test items as the test set.

random.shuffle(shuffled)
n = len(shuffled)
n_test = int(n * 0.15)
n_val  = int(n * 0.15)

test  = shuffled[]
val   = shuffled[n_test:n_test + n_val]
train = shuffled[n_test + n_val:]

Reorder the lines#5

Put these steps in the correct order to prepare data before training a model.

Lock the test set away and train the model on the training set

Load the raw dataset and inspect it for missing or wrong values

Separate each row into a feature vector X and a label y

Shuffle the data, then split into train, validation, and test sets

Fit feature scaling on the training set and apply it

Your turn

Practice exercise

You are given a small dataset of students as a list of dicts. Each dict has keys: 'study_hours', 'sleep_hours', 'prev_score', and 'passed' (0 or 1).

Your tasks: 1. Separate the data into X (list of feature lists) and y (list of labels). Features are study_hours, sleep_hours, prev_score in that order. 2. Split X and y into 60% train and 40% test (no shuffling needed — just slice). 3. Apply min-max normalization to the 'study_hours' column of the training set and print the scaled values rounded to 2 decimal places.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable

students = [
    {"study_hours": 8,  "sleep_hours": 7, "prev_score": 72, "passed": 1},
    {"study_hours": 2,  "sleep_hours": 5, "prev_score": 45, "passed": 0},
    {"study_hours": 5,  "sleep_hours": 8, "prev_score": 60, "passed": 1},
    {"study_hours": 1,  "sleep_hours": 4, "prev_score": 38, "passed": 0},
    {"study_hours": 9,  "sleep_hours": 7, "prev_score": 85, "passed": 1},
    {"study_hours": 3,  "sleep_hours": 6, "prev_score": 50, "passed": 0},
    {"study_hours": 7,  "sleep_hours": 8, "prev_score": 78, "passed": 1},
    {"study_hours": 4,  "sleep_hours": 5, "prev_score": 55, "passed": 0},
    {"study_hours": 6,  "sleep_hours": 7, "prev_score": 65, "passed": 1},
    {"study_hours": 10, "sleep_hours": 8, "prev_score": 90, "passed": 1},
]

# Step 1: Build X and y
X = []
y = []
# your code here

# Step 2: Split into train / test (60 / 40 split)
# your code here

# Step 3: Min-max scale study_hours in the training set
# your code here

# Print results