Skip to main content

How to Implement One-Hot Encoding for Labels in Python

One-hot encoding is a fundamental technique in machine learning preprocessing. It transforms categorical text data (like "Red", "Green", "Blue") into numerical vectors (like [1, 0, 0], [0, 1, 0]) that algorithms can process.

In this guide, you will learn how to manually implement a robust one-hot encoding function in Python without relying on external data science libraries.

Understanding One-Hot Encoding

Machine learning models typically require numerical input. If your dataset contains categorical labels, you cannot simply assign them random numbers (e.g., Cat=1, Dog=2, Bird=3) because the model might misinterpret the mathematical relationship (assuming Dog is "twice" the value of Cat).

One-Hot Encoding solves this by creating a binary vector for each category.

  • Categories: ["Python", "Java", "C++"]
  • "Python" becomes [1, 0, 0]
  • "Java" becomes [0, 1, 0]

Step 1: Implementing the Encoding Function

We will create a function that takes the universe of possible labels and the specific samples to encode, returning a list of binary vectors.

Logic:

  1. Initialize a zero-vector with a length equal to the number of unique classes.
  2. Find the index of the current sample's label in the class list.
  3. Set that specific index to 1.
from typing import List

def label_process(labels: List[str], sample_y: List[str]) -> List[List[int]]:
"""
Manually encodes a list of labels into one-hot vectors.
"""
train_y = []

for y in sample_y:
# 1. Create a zero vector of the correct length
# E.g., if we have 3 classes, vector is [0, 0, 0]
vector = [0] * len(labels)

# 2. Find index and set to 1
# If y is "Java" and labels are ["Python", "Java"], index is 1
try:
target_index = labels.index(y)
vector[target_index] = 1
except ValueError:
# Handle unknown labels gracefully if necessary
print(f"Warning: Label '{y}' not found in known classes.")

train_y.append(vector)

return train_y
note

This manual implementation helps you understand the underlying logic. In production environments with large datasets, libraries like pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder are optimized for performance.

Step 2: Testing with Sample Data

Let's apply the function to a hypothetical classification task involving programming languages.

if __name__ == "__main__":
# Define the universe of known classes (The Vocabulary)
unique_classes = ["Python", "Java", "Tensorflow", "Springboot", "Keras"]

# Define the raw data samples to encode
raw_data = [
"Python",
"Python",
"Java",
"Keras",
"UnknownLib" # Testing edge case
]

print(f"Classes: {unique_classes}")

# ✅ Perform Encoding
encoded_data = label_process(unique_classes, raw_data)

print("\nOne-Hot Encoded Output:")
for raw, vector in zip(raw_data, encoded_data):
print(f"'{raw}': \t{vector}")

Execution Output:

Classes: ['Python', 'Java', 'Tensorflow', 'Springboot', 'Keras']
Warning: Label 'UnknownLib' not found in known classes.

One-Hot Encoded Output:
'Python': [1, 0, 0, 0, 0]
'Python': [1, 0, 0, 0, 0]
'Java': [0, 1, 0, 0, 0]
'Keras': [0, 0, 0, 0, 1]
'UnknownLib': [0, 0, 0, 0, 0]

Conclusion

By implementing one-hot encoding manually, you master the concept of mapping categorical strings to sparse numerical vectors.

  1. Define your list of unique classes.
  2. Iterate through your data samples.
  3. Map the string to its index and toggle that bit to 1.