How to Create a Confusion Matrix in Python

A confusion matrix is a specific table layout that allows visualization of the performance of an algorithm. In classification tasks, it compares the predicted labels against the actual labels (ground truth), breaking down results into true positives, true negatives, false positives, and false negatives.

This guide explains how to implement a confusion matrix function from scratch using Python and NumPy, processing raw probability predictions into a readable grid.

Understanding the Inputs

To build a matrix, we need three pieces of data:

Labels: The names of the classes (e.g., ["Cat", "Dog", "Bird"]).
Predictions (preds): A list of probability vectors. For example, [0.1, 0.8, 0.1] implies the model is 80% sure it's the second class ("Dog").
Ground Truth: The actual correct labels for the data.

Our goal is to map the highest probability in the prediction to a row, and the actual label to a column, then increment the counter at that intersection.

Step 1: Setting Up the Environment

We rely on the numpy library to efficiently find the index of the maximum value in a prediction list (the class the model "chose").

pip install numpy

Start your file confusion_matrix.py with the necessary imports:

import numpy as np
from typing import List

Step 2: Implementing the Function

The core logic requires iterating through every prediction, converting the probability distribution into a concrete class index, and matching it against the truth.

The Algorithm

def confusion_matrix(
    labels: List[str], 
    preds: List[List[float]], 
    ground_truth: List[str]
) -> List[List[int]]:
    """
    Computes the confusion matrix.
    Rows = Predicted Class
    Columns = True Class
    """
    
    # 1. Initialize an N x N matrix with zeros
    # N is the number of unique labels
    num_classes = len(labels)
    matrix = [[0 for _ in range(num_classes)] for _ in range(num_classes)]

    # 2. Iterate through predictions and truth simultaneously
    for pred_probs, truth_label in zip(preds, ground_truth):
        
        # ✅ Correct: Find the index of the highest probability
        # This determines what the model 'predicted'
        pred_index = np.argmax(pred_probs)
        
        # ✅ Correct: Find the index of the actual label
        # This determines where it 'should' have been
        try:
            truth_index = labels.index(truth_label)
        except ValueError:
            continue # Skip if label is not found in our definitions

        # 3. Increment the specific cell
        # matrix[Row (Predicted)][Column (Actual)]
        matrix[pred_index][truth_index] += 1

    return matrix

note

In this implementation, Rows represent the Predicted classes, and Columns represent the True classes. Some libraries (like Scikit-Learn) might transpose this (Rows=True, Cols=Predicted). Always check the documentation or implementation details.

Step 3: Testing with Example Data

To ensure our logic holds, we test with a small dataset where we can manually verify the results.

Test Scenario

Labels: Python (0), Java (1), C++ (2)
Data: 5 samples.
Logic: If the model predicts "Python" but the truth is "Java", we add 1 to Row 0, Column 1.

if __name__ == "__main__":
    # Define Classes
    labels = ["Python", "Java", "C++"]
    
    # Probabilities (Model Output)
    preds = [
        [0.66, 0.22, 0.11], # Max is index 0 (Python)
        [0.34, 0.05, 0.60], # Max is index 2 (C++)
        [0.47, 0.26, 0.26], # Max is index 0 (Python)
        [0.76, 0.15, 0.08], # Max is index 0 (Python)
        [0.05, 0.95, 0.00], # Max is index 1 (Java)
    ]
    
    # Actual Values
    # 1. Truth: Python. Pred: Python. (Correct)
    # 2. Truth: C++.    Pred: C++.    (Correct)
    # 3. Truth: Java.   Pred: Python. (Miss)
    # 4. Truth: C++.    Pred: Python. (Miss)
    # 5. Truth: Java.   Pred: Java.   (Correct)
    ground_truth = ["Python", "C++", "Java", "C++", "Java"]

    # Calculate
    result = confusion_matrix(labels, preds, ground_truth)
    
    print("Confusion Matrix:")
    for row in result:
        print(row)

Output:

Confusion Matrix:
[1, 1, 1]
[0, 1, 0]
[0, 0, 1]

Interpreting the Output

Row 0 (Predicted Python): [1, 1, 1]
- 1 instance was actually Python (Correct).
- 1 instance was actually Java (Incorrect).
- 1 instance was actually C++ (Incorrect).
- Insight: The model is over-predicting "Python".
Row 1 (Predicted Java): [0, 1, 0]
- 1 instance was actually Java (Correct).
Row 2 (Predicted C++): [0, 0, 1]
- 1 instance was actually C++ (Correct).

tip

A perfect model would result in a diagonal matrix (non-zero values only from top-left to bottom-right), meaning every prediction matched the truth perfectly.

Conclusion

By implementing the confusion matrix manually, you gain insight into how classification accuracy is derived.

np.argmax converts probabilities to a hard class prediction.
zip allows you to process predictions and truth in parallel.
Matrix indexing (matrix[pred][truth]) accumulates the results for analysis.

Understanding the Inputs​

Step 1: Setting Up the Environment​

Step 2: Implementing the Function​

Step 3: Testing with Example Data​

Test Scenario​

Interpreting the Output​

Conclusion​

Table of Contents