Data, Features & Labels
Learn how machine learning sees the world: rows as examples, columns as features, and one special column called the label that the model tries to predict.
Imagine you are a doctor trying to predict whether a patient has a disease. You look at their age, blood pressure, cholesterol level, and whether they smoke. Each of these measurements tells you something useful. A machine learning model works exactly the same way — you hand it a pile of measurements, tell it what the right answer was for each past patient, and it learns patterns that help it predict future patients.
Before any algorithm runs, before any math happens, you need data in the right shape. Getting that shape right is more than half the battle.
#The Spreadsheet View of Data
Picture your data as a classic spreadsheet. Each row is one example — one patient, one email, one house sale, one day of weather. Rows are also called samples or observations.
Each column (except the last special one) is a feature — a measurable property of that example. Features are the inputs you feed the model. The collection of all feature values for a single row is called a feature vector.
The one special column is the label (also called the target or y). This is the answer you already know for past examples, and the thing you want the model to predict for future ones. Features are usually called X; the label is called y.
The Flash Cards Analogy
Think of each row as a flash card. The front of the card shows all the features — age: 45, smoker: yes, cholesterol: 220. The back of the card shows the label — disease: yes.
You study thousands of cards (training). Then someone hands you a new card with only the front filled in and asks you to guess the back. That's exactly what a trained model does.
#Building Feature Rows in Python
You don't need any library to understand features. A feature vector is just a list of numbers. A dataset is just a list of those lists. Let's build one from scratch to make it concrete.
# Each row = one house. Features: [size_sqft, num_rooms, age_years]
# Label (y): price in thousands
X = [
[1200, 3, 10],
[850, 2, 25],
[2100, 5, 3 ],
[500, 1, 40],
]
y = [250, 180, 420, 95]
print("Number of samples:", len(X))
print("Number of features:", len(X[0]))
print("Feature vector for house 0:", X[0])
print("Label for house 0: $", y[0], "k")You can also use dictionaries to keep track of which feature is which — especially helpful while exploring your data. In real projects, libraries like pandas turn this into a DataFrame, but the idea is identical.
# Same data as named dicts (easier to read during exploration)
samples = [
{"size_sqft": 1200, "num_rooms": 3, "age_years": 10, "price_k": 250},
{"size_sqft": 850, "num_rooms": 2, "age_years": 25, "price_k": 180},
]
for s in samples:
features = {k: v for k, v in s.items() if k != "price_k"}
label = s["price_k"]
print("X:", features, "-> y:", label)#Train, Validation, and Test Sets
Here is a crucial question: once the model has learned from your data, how do you know if it actually learned something general, or just memorized your specific examples?
The answer is to hide some data from the model during training and only use it later to measure performance. We split our dataset into three buckets:
- Training set (~70-80%) — the examples the model learns from.
- Validation set (~10-15%) — examples we check during development to tune settings (called hyperparameters).
- Test set (~10-15%) — examples locked away until the very end, to give an honest final score.
The test set is your blind exam. You never look at it until you are completely done building. If you peek and adjust your model, the score becomes dishonest.
import random
def train_val_test_split(data, val_ratio=0.15, test_ratio=0.15, seed=42):
random.seed(seed)
shuffled = data[:]
random.shuffle(shuffled)
n = len(shuffled)
n_test = int(n * test_ratio)
n_val = int(n * val_ratio)
test = shuffled[:n_test]
val = shuffled[n_test:n_test + n_val]
train = shuffled[n_test + n_val:]
return train, val, test
all_data = list(range(100)) # pretend 100 samples
train, val, test = train_val_test_split(all_data)
print(f"Train: {len(train)} Val: {len(val)} Test: {len(test)}")The Data Leakage Trap
Never use test-set data to make any decision during training — not even to choose how many features to use. If information from the test set 'leaks' into your training process, your final accuracy score will be falsely optimistic. In the real world your model will perform worse than you thought.
Also: always shuffle before splitting, otherwise you might accidentally put all the rare examples in the test set.
#Garbage In, Garbage Out
No algorithm, no matter how clever, can rescue a bad dataset. This is such a well-known truth in ML that it has its own name: GIGO — garbage in, garbage out.
Common data quality problems: - Missing values — a sensor didn't record, a form field was left blank. - Wrong labels — a human annotator made an error. - Biased collection — you only collected data from one city, but you want to predict for the whole country. - Irrelevant features — including a column like 'customer ID' that has nothing to do with the outcome but can confuse some models.
Spending time cleaning and understanding your data is never wasted time.
#Feature Scaling: Why Size Matters
Look back at our house features: [size_sqft, num_rooms, age_years] = [1200, 3, 10]. The first feature is hundreds of times larger than the others just because of the units we chose. Many algorithms (especially those that measure 'distance' between points, or use gradient descent) can be thrown off by this imbalance — they might treat size as far more important than rooms simply because its numbers are bigger.
Feature scaling fixes this by transforming every feature into a comparable range. The two most common methods are:
- Min-Max normalization — squeezes values into [0, 1]. Formula in words: subtract the minimum, then divide by (max minus min).
- Standardization (Z-score) — shifts values so they have mean 0 and standard deviation 1. Formula in words: subtract the mean, then divide by the standard deviation.
Neither method changes which data point is largest or smallest — it only rescales the axis.
def min_max_scale(values):
mn, mx = min(values), max(values)
return [(v - mn) / (mx - mn) for v in values]
def z_score_scale(values):
n = len(values)
mean = sum(values) / n
std = (sum((v - mean) ** 2 for v in values) / n) ** 0.5
return [(v - mean) / std for v in values]
sizes = [1200, 850, 2100, 500]
print("Original: ", sizes)
print("Min-Max: ", [round(v, 2) for v in min_max_scale(sizes)])
print("Z-score: ", [round(v, 2) for v in z_score_scale(sizes)])When Do You Need Scaling?
Always scale for: k-nearest neighbors, support vector machines, neural networks, linear/logistic regression with gradient descent.
Scaling doesn't matter for: decision trees and random forests — they split on thresholds, so the scale of a feature is irrelevant.
When in doubt, scale. It rarely hurts and often helps.
You are building a model to predict apartment rent. Your dataset has columns: apartment_id, size_m2, floor, distance_to_metro_km, rent_euros. Which column is the label (y)?
In Real Projects
Libraries like pandas give you DataFrames that make all of this easier to inspect and manipulate. scikit-learn provides train_test_split, MinMaxScaler, and StandardScaler out of the box. But they are doing exactly what you just wrote by hand — now you know what's happening inside.
Key takeaways
- Rows are samples/examples; columns are features (X); the one column you want to predict is the label (y).
- A feature vector is simply the list of feature values for one sample.
- Always split data into train, validation, and test sets — the test set is a sealed envelope you open only once.
- Garbage in, garbage out: no algorithm compensates for bad, biased, or mislabeled data.
- Feature scaling (min-max or z-score) puts all features on the same playing field, which matters for distance-based and gradient-based algorithms.
This dataset holds house feature vectors. What does the code print?
X = [
[1200, 3, 10],
[850, 2, 25],
[2100, 5, 3 ],
]
print("Number of samples:", len(X))
print("Number of features:", len(X[0]))Min-max normalization squeezes values into the [0, 1] range. What does this print?
def min_max_scale(values):
mn, mx = min(values), max(values)
return [(v - mn) / (mx - mn) for v in values]
rooms = [3, 2, 5, 1]
print([round(v, 2) for v in min_max_scale(rooms)])This code separates features (X) from the label (y) for house-price data. It has a bug — what's wrong?
sample = {"size_sqft": 1200, "num_rooms": 3, "age_years": 10, "price_k": 250}
features = {k: v for k, v in sample.items()}
label = sample["price_k"]
print("X:", features, "-> y:", label)We are splitting a shuffled dataset into test, validation, and train slices. Fill in the slice that takes the FIRST n_test items as the test set.
random.shuffle(shuffled) n = len(shuffled) n_test = int(n * 0.15) n_val = int(n * 0.15) test = shuffled[] val = shuffled[n_test:n_test + n_val] train = shuffled[n_test + n_val:]
Put these steps in the correct order to prepare data before training a model.
Lock the test set away and train the model on the training set
Load the raw dataset and inspect it for missing or wrong values
Separate each row into a feature vector X and a label y
Shuffle the data, then split into train, validation, and test sets
Fit feature scaling on the training set and apply it
You are given a small dataset of students as a list of dicts. Each dict has keys: 'study_hours', 'sleep_hours', 'prev_score', and 'passed' (0 or 1).
Your tasks: 1. Separate the data into X (list of feature lists) and y (list of labels). Features are study_hours, sleep_hours, prev_score in that order. 2. Split X and y into 60% train and 40% test (no shuffling needed — just slice). 3. Apply min-max normalization to the 'study_hours' column of the training set and print the scaled values rounded to 2 decimal places.
Try it live — edit the code and hit Run to execute real Python: