How to Implement One-Hot Encoding for Labels in Python
One-hot encoding is a fundamental technique in machine learning preprocessing. It transforms categorical text data (like "Red", "Green", "Blue") into numerical vectors (like [1, 0, 0], [0, 1, 0]) that algorithms can process.
In this guide, you will learn how to manually implement a robust one-hot encoding function in Python without relying on external data science libraries.
Understanding One-Hot Encoding
Machine learning models typically require numerical input. If your dataset contains categorical labels, you cannot simply assign them random numbers (e.g., Cat=1, Dog=2, Bird=3) because the model might misinterpret the mathematical relationship (assuming Dog is "twice" the value of Cat).
One-Hot Encoding solves this by creating a binary vector for each category.
- Categories:
["Python", "Java", "C++"] - "Python" becomes
[1, 0, 0] - "Java" becomes
[0, 1, 0]
Step 1: Implementing the Encoding Function
We will create a function that takes the universe of possible labels and the specific samples to encode, returning a list of binary vectors.
Logic:
- Initialize a zero-vector with a length equal to the number of unique classes.
- Find the index of the current sample's label in the class list.
- Set that specific index to
1.
from typing import List
def label_process(labels: List[str], sample_y: List[str]) -> List[List[int]]:
"""
Manually encodes a list of labels into one-hot vectors.
"""
train_y = []
for y in sample_y:
# 1. Create a zero vector of the correct length
# E.g., if we have 3 classes, vector is [0, 0, 0]
vector = [0] * len(labels)
# 2. Find index and set to 1
# If y is "Java" and labels are ["Python", "Java"], index is 1
try:
target_index = labels.index(y)
vector[target_index] = 1
except ValueError:
# Handle unknown labels gracefully if necessary
print(f"Warning: Label '{y}' not found in known classes.")
train_y.append(vector)
return train_y
This manual implementation helps you understand the underlying logic. In production environments with large datasets, libraries like pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder are optimized for performance.
Step 2: Testing with Sample Data
Let's apply the function to a hypothetical classification task involving programming languages.
if __name__ == "__main__":
# Define the universe of known classes (The Vocabulary)
unique_classes = ["Python", "Java", "Tensorflow", "Springboot", "Keras"]
# Define the raw data samples to encode
raw_data = [
"Python",
"Python",
"Java",
"Keras",
"UnknownLib" # Testing edge case
]
print(f"Classes: {unique_classes}")
# ✅ Perform Encoding
encoded_data = label_process(unique_classes, raw_data)
print("\nOne-Hot Encoded Output:")
for raw, vector in zip(raw_data, encoded_data):
print(f"'{raw}': \t{vector}")
Execution Output:
Classes: ['Python', 'Java', 'Tensorflow', 'Springboot', 'Keras']
Warning: Label 'UnknownLib' not found in known classes.
One-Hot Encoded Output:
'Python': [1, 0, 0, 0, 0]
'Python': [1, 0, 0, 0, 0]
'Java': [0, 1, 0, 0, 0]
'Keras': [0, 0, 0, 0, 1]
'UnknownLib': [0, 0, 0, 0, 0]
Conclusion
By implementing one-hot encoding manually, you master the concept of mapping categorical strings to sparse numerical vectors.
- Define your list of unique classes.
- Iterate through your data samples.
- Map the string to its index and toggle that bit to
1.