Python NumPy: How to Find Unique Rows in a NumPy Array in Python
When working with 2D data in NumPy, whether it's tabular datasets, coordinate lists, or matrix operations, you'll often need to remove duplicate rows and keep only the unique ones. This is analogous to the "Remove Duplicates" feature in spreadsheet applications.
For example:
- Input:
[[1, 2], [3, 4], [1, 2], [5, 6]] - Output:
[[1, 2], [3, 4], [5, 6]]
This guide covers several approaches to finding unique rows, from the simplest built-in method to more advanced techniques that offer additional control over performance and ordering.
Using np.unique() with axis=0 (Recommended)
The simplest and most readable way to find unique rows is np.unique() with axis=0. This tells NumPy to treat each entire row as a single unit when checking for duplicates.
import numpy as np
arr = np.array([
[1, 2],
[3, 4],
[1, 2],
[5, 6]
])
unique_rows = np.unique(arr, axis=0)
print(unique_rows)
Output:
[[1 2]
[3 4]
[5 6]]
np.unique() compares rows element by element, removes duplicates, and returns the remaining rows in sorted order.
np.unique() sorts the outputThe result is always sorted lexicographically, which means the original order of rows is not preserved. If maintaining insertion order matters, see the section on preserving original row order below.
Getting Additional Information
np.unique() can return useful metadata alongside the unique rows:
import numpy as np
arr = np.array([
[3, 4],
[1, 2],
[3, 4],
[1, 2],
[5, 6]
])
unique_rows, indices, inverse, counts = np.unique(
arr, axis=0, return_index=True, return_inverse=True, return_counts=True
)
print("Unique rows:\n", unique_rows)
print("First occurrence indices:", indices)
print("Inverse mapping:", inverse)
print("Counts:", counts)
Output:
Unique rows:
[[1 2]
[3 4]
[5 6]]
First occurrence indices: [1 0 4]
Inverse mapping: [1 0 1 0 2]
Counts: [2 2 1]
| Parameter | Description |
|---|---|
return_index=True | Indices of the first occurrence of each unique row |
return_inverse=True | Indices to reconstruct the original array from the unique rows |
return_counts=True | How many times each unique row appears |
Preserving Original Row Order
Since np.unique() sorts results, you need an extra step if you want to keep the rows in their original order of first appearance.
import numpy as np
arr = np.array([
[5, 6],
[1, 2],
[3, 4],
[1, 2],
[5, 6]
])
# Get unique rows and the index of each first occurrence
_, first_indices = np.unique(arr, axis=0, return_index=True)
# Sort the indices to restore original order
unique_rows = arr[np.sort(first_indices)]
print(unique_rows)
Output:
[[5 6]
[1 2]
[3 4]]
By sorting the first_indices, we retrieve unique rows in the same order they first appeared in the original array.
This pattern - using return_index=True and then sorting the indices - is a reliable way to get order-preserving unique rows with NumPy.
Using np.void View for Large Arrays
For performance-critical applications with large arrays, you can convert each row into a single byte block using np.void. This allows NumPy to compare rows as atomic units, which can be significantly faster.
import numpy as np
arr = np.array([
[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[7, 8, 9]
])
# Ensure the array is contiguous in memory
arr_contiguous = np.ascontiguousarray(arr)
# View each row as a single void (byte) element
row_view = arr_contiguous.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[1])))
# Find unique byte rows and their original indices
_, idx = np.unique(row_view, return_index=True)
# Retrieve unique rows in original order
unique_rows = arr[np.sort(idx)]
print(unique_rows)
Output:
[[1 2 3]
[4 5 6]
[7 8 9]]
How It Works
np.ascontiguousarray()ensures the array's memory layout is contiguous (required for the view operation)..view(np.void)reinterprets each row's bytes as a single opaque element, making row comparisons extremely fast.np.unique()operates on these byte-level elements to find duplicates.- Sorting the indices preserves the original row order.
This method is less readable than np.unique(axis=0) but can offer better performance on very large arrays (millions of rows). For most use cases, the simpler np.unique(axis=0) is sufficient.
Using np.lexsort() with np.diff()
np.lexsort() sorts rows lexicographically, grouping duplicates together. Then np.diff() identifies where consecutive rows differ, effectively marking the boundaries between unique groups.
import numpy as np
arr = np.array([
[7, 8],
[7, 9],
[6, 8],
[7, 8]
])
# Sort rows lexicographically
sorted_indices = np.lexsort(arr.T[::-1])
sorted_arr = arr[sorted_indices]
# Build a mask: True for the first row and wherever a row differs from the previous
mask = np.ones(len(sorted_arr), dtype=bool)
mask[1:] = np.any(np.diff(sorted_arr, axis=0), axis=1)
unique_rows = sorted_arr[mask]
print(unique_rows)
Output:
[[6 8]
[7 8]
[7 9]]
Step-by-Step Breakdown
np.lexsort(arr.T[::-1])- sorts by the first column, then the second, etc. The[::-1]reversal is needed becauselexsortsorts by the last key first.np.diff(sorted_arr, axis=0)- computes the difference between consecutive rows. Non-zero differences indicate a new unique row.np.any(..., axis=1)- collapses each row of differences into a single boolean:Trueif any element differs.
This approach gives you more control over the sorting and filtering logic and is useful when you need to customize the deduplication process.
Using Python Sets (Simple but Limited)
For quick deduplication on small arrays, you can convert rows to tuples and use a Python set to remove duplicates:
import numpy as np
arr = np.array([
[0, 0],
[1, 1],
[0, 0],
[2, 2]
])
unique_rows = np.array(list({tuple(row) for row in arr}))
print(unique_rows)
Output:
[[1 1]
[2 2]
[0 0]]
- Order is not preserved: sets are unordered in Python.
- Floating-point issues: sets use exact equality, so rows like
[0.1 + 0.2, 0.3]and[0.30000000000000004, 0.3]would be treated as different. - Performance: converting to Python objects loses NumPy's vectorized speed advantage.
This method is best for small arrays or quick prototyping, not production code with large datasets.
Comparison of Approaches
| Method | Preserves Order | Performance | Readability | Best For |
|---|---|---|---|---|
np.unique(axis=0) | ❌ (sorted) | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Most use cases (recommended) |
np.unique() + np.sort(idx) | ✅ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | When original order matters |
np.void view | ✅ (with sorted idx) | ⭐⭐⭐⭐⭐ | ⭐⭐ | Very large arrays |
np.lexsort() + np.diff() | ❌ (sorted) | ⭐⭐⭐⭐ | ⭐⭐⭐ | Custom sorting/filtering logic |
Python set(tuple(...)) | ❌ | ⭐⭐ | ⭐⭐⭐ | Small arrays, quick prototyping |
Conclusion
For the vast majority of use cases, np.unique(arr, axis=0) is the best choice: it's clean, fast, and requires just one line of code.
- If you need to preserve the original row order, combine it with
return_index=Trueandnp.sort(). - For performance-critical applications with very large arrays, the
np.voidview technique offers the fastest comparisons.
Choose the method that best balances readability, performance, and ordering requirements for your specific situation.