Skip to main content

Python NumPy: How to Find Unique Rows in a NumPy Array in Python

When working with 2D data in NumPy, whether it's tabular datasets, coordinate lists, or matrix operations, you'll often need to remove duplicate rows and keep only the unique ones. This is analogous to the "Remove Duplicates" feature in spreadsheet applications.

For example:

  • Input: [[1, 2], [3, 4], [1, 2], [5, 6]]
  • Output: [[1, 2], [3, 4], [5, 6]]

This guide covers several approaches to finding unique rows, from the simplest built-in method to more advanced techniques that offer additional control over performance and ordering.

The simplest and most readable way to find unique rows is np.unique() with axis=0. This tells NumPy to treat each entire row as a single unit when checking for duplicates.

import numpy as np

arr = np.array([
[1, 2],
[3, 4],
[1, 2],
[5, 6]
])

unique_rows = np.unique(arr, axis=0)
print(unique_rows)

Output:

[[1 2]
[3 4]
[5 6]]
note

np.unique() compares rows element by element, removes duplicates, and returns the remaining rows in sorted order.

np.unique() sorts the output

The result is always sorted lexicographically, which means the original order of rows is not preserved. If maintaining insertion order matters, see the section on preserving original row order below.

Getting Additional Information

np.unique() can return useful metadata alongside the unique rows:

import numpy as np

arr = np.array([
[3, 4],
[1, 2],
[3, 4],
[1, 2],
[5, 6]
])

unique_rows, indices, inverse, counts = np.unique(
arr, axis=0, return_index=True, return_inverse=True, return_counts=True
)

print("Unique rows:\n", unique_rows)
print("First occurrence indices:", indices)
print("Inverse mapping:", inverse)
print("Counts:", counts)

Output:

Unique rows:
[[1 2]
[3 4]
[5 6]]
First occurrence indices: [1 0 4]
Inverse mapping: [1 0 1 0 2]
Counts: [2 2 1]
ParameterDescription
return_index=TrueIndices of the first occurrence of each unique row
return_inverse=TrueIndices to reconstruct the original array from the unique rows
return_counts=TrueHow many times each unique row appears

Preserving Original Row Order

Since np.unique() sorts results, you need an extra step if you want to keep the rows in their original order of first appearance.

import numpy as np

arr = np.array([
[5, 6],
[1, 2],
[3, 4],
[1, 2],
[5, 6]
])

# Get unique rows and the index of each first occurrence
_, first_indices = np.unique(arr, axis=0, return_index=True)

# Sort the indices to restore original order
unique_rows = arr[np.sort(first_indices)]
print(unique_rows)

Output:

[[5 6]
[1 2]
[3 4]]

By sorting the first_indices, we retrieve unique rows in the same order they first appeared in the original array.

tip

This pattern - using return_index=True and then sorting the indices - is a reliable way to get order-preserving unique rows with NumPy.

Using np.void View for Large Arrays

For performance-critical applications with large arrays, you can convert each row into a single byte block using np.void. This allows NumPy to compare rows as atomic units, which can be significantly faster.

import numpy as np

arr = np.array([
[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[7, 8, 9]
])

# Ensure the array is contiguous in memory
arr_contiguous = np.ascontiguousarray(arr)

# View each row as a single void (byte) element
row_view = arr_contiguous.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[1])))

# Find unique byte rows and their original indices
_, idx = np.unique(row_view, return_index=True)

# Retrieve unique rows in original order
unique_rows = arr[np.sort(idx)]
print(unique_rows)

Output:

[[1 2 3]
[4 5 6]
[7 8 9]]

How It Works

  1. np.ascontiguousarray() ensures the array's memory layout is contiguous (required for the view operation).
  2. .view(np.void) reinterprets each row's bytes as a single opaque element, making row comparisons extremely fast.
  3. np.unique() operates on these byte-level elements to find duplicates.
  4. Sorting the indices preserves the original row order.
note

This method is less readable than np.unique(axis=0) but can offer better performance on very large arrays (millions of rows). For most use cases, the simpler np.unique(axis=0) is sufficient.

Using np.lexsort() with np.diff()

np.lexsort() sorts rows lexicographically, grouping duplicates together. Then np.diff() identifies where consecutive rows differ, effectively marking the boundaries between unique groups.

import numpy as np

arr = np.array([
[7, 8],
[7, 9],
[6, 8],
[7, 8]
])

# Sort rows lexicographically
sorted_indices = np.lexsort(arr.T[::-1])
sorted_arr = arr[sorted_indices]

# Build a mask: True for the first row and wherever a row differs from the previous
mask = np.ones(len(sorted_arr), dtype=bool)
mask[1:] = np.any(np.diff(sorted_arr, axis=0), axis=1)

unique_rows = sorted_arr[mask]
print(unique_rows)

Output:

[[6 8]
[7 8]
[7 9]]

Step-by-Step Breakdown

  1. np.lexsort(arr.T[::-1]) - sorts by the first column, then the second, etc. The [::-1] reversal is needed because lexsort sorts by the last key first.
  2. np.diff(sorted_arr, axis=0) - computes the difference between consecutive rows. Non-zero differences indicate a new unique row.
  3. np.any(..., axis=1) - collapses each row of differences into a single boolean: True if any element differs.

This approach gives you more control over the sorting and filtering logic and is useful when you need to customize the deduplication process.

Using Python Sets (Simple but Limited)

For quick deduplication on small arrays, you can convert rows to tuples and use a Python set to remove duplicates:

import numpy as np

arr = np.array([
[0, 0],
[1, 1],
[0, 0],
[2, 2]
])

unique_rows = np.array(list({tuple(row) for row in arr}))
print(unique_rows)

Output:

[[1 1]
[2 2]
[0 0]]
Limitations of the set approach
  • Order is not preserved: sets are unordered in Python.
  • Floating-point issues: sets use exact equality, so rows like [0.1 + 0.2, 0.3] and [0.30000000000000004, 0.3] would be treated as different.
  • Performance: converting to Python objects loses NumPy's vectorized speed advantage.

This method is best for small arrays or quick prototyping, not production code with large datasets.

Comparison of Approaches

MethodPreserves OrderPerformanceReadabilityBest For
np.unique(axis=0)❌ (sorted)⭐⭐⭐⭐⭐⭐⭐⭐⭐Most use cases (recommended)
np.unique() + np.sort(idx)⭐⭐⭐⭐⭐⭐⭐⭐When original order matters
np.void view✅ (with sorted idx)⭐⭐⭐⭐⭐⭐⭐Very large arrays
np.lexsort() + np.diff()❌ (sorted)⭐⭐⭐⭐⭐⭐⭐Custom sorting/filtering logic
Python set(tuple(...))⭐⭐⭐⭐⭐Small arrays, quick prototyping

Conclusion

For the vast majority of use cases, np.unique(arr, axis=0) is the best choice: it's clean, fast, and requires just one line of code.

  • If you need to preserve the original row order, combine it with return_index=True and np.sort().
  • For performance-critical applications with very large arrays, the np.void view technique offers the fastest comparisons.

Choose the method that best balances readability, performance, and ordering requirements for your specific situation.