Python Pandas: How to Count Distinct Values in Pandas with GroupBy
Counting unique occurrences within groups is a fundamental operation in business analytics and data exploration. Questions like "how many unique customers purchased each day?", "how many distinct products were sold per region?", or "how many different error types occurred per server?" all require counting distinct values within grouped data. Pandas provides the nunique() method specifically for this purpose, and it integrates seamlessly with groupby() for grouped distinct counts.
In this guide, you will learn how to count distinct values within groups, combine distinct counts with other aggregations, handle missing values, and produce clean output suitable for reporting.
Basic Distinct Count with nunique()
The simplest case is counting unique values in one column after grouping by another. The nunique() method returns the number of distinct values within each group:
import pandas as pd
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie']
})
# Count unique users per date
unique_users = df.groupby('Date')['User'].nunique()
print(unique_users)
Output:
Date
2024-01-01 2
2024-01-02 1
Name: User, dtype: int64
On January 1st, Alice appears twice and Bob once, but there are only 2 unique users. On January 2nd, Charlie appears twice but counts as only 1 unique user. The nunique() method automatically handles the deduplication.
Combining Distinct Counts with Other Aggregations
In practice, you often need distinct counts alongside other metrics like totals or averages. The .agg() method lets you apply different aggregation functions to different columns in a single call:
import pandas as pd
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'User': ['Alice', 'Bob', 'Alice', 'Alice'],
'Sales': [100, 150, 200, 75]
})
result = df.groupby('Date').agg({
'Sales': 'sum',
'User': 'nunique'
})
print(result)
Output:
Sales User
Date
2024-01-01 250 2
2024-01-02 275 1
This tells you that January 1st had $250 in total sales from 2 unique customers, while January 2nd had $275 from just 1 customer (Alice, who made two purchases).
Named Aggregation for Clean Output
The dictionary-based .agg() approach reuses the original column names, which can be unclear in reports. Named aggregation produces descriptive column names directly:
import pandas as pd
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02'],
'User': ['Alice', 'Bob', 'Alice'],
'Sales': [100, 150, 200]
})
result = df.groupby('Date').agg(
total_revenue=('Sales', 'sum'),
unique_customers=('User', 'nunique'),
transactions=('Sales', 'count')
).reset_index()
print(result)
Date total_revenue unique_customers transactions
0 2024-01-01 250 2 2
1 2024-01-02 200 1 1
Each column name clearly describes what it contains, making the output self-documenting and ready for reporting or dashboards.
Named aggregation uses the syntax output_name=('source_column', 'function'). This is the recommended approach for production code because it eliminates ambiguity and produces results that are immediately understandable without additional context.
Grouping by Multiple Columns
You can group by two or more columns to get distinct counts at a finer granularity:
import pandas as pd
df = pd.DataFrame({
'Region': ['East', 'East', 'East', 'West', 'West'],
'Product': ['A', 'A', 'A', 'B', 'B'],
'Customer': ['C1', 'C2', 'C1', 'C1', 'C1']
})
result = df.groupby(['Region', 'Product'])['Customer'].nunique()
print(result)
Output:
Region Product
East A 2
West B 1
Name: Customer, dtype: int64
Product A in the East region has 2 unique customers (C1 and C2), while Product B in the West region has only 1 (C1 appeared twice but is counted once).
Flattening the Result for Export
import pandas as pd
df = pd.DataFrame({
'Region': ['East', 'East', 'East', 'West', 'West'],
'Product': ['A', 'A', 'A', 'B', 'B'],
'Customer': ['C1', 'C2', 'C1', 'C1', 'C1']
})
result_df = df.groupby(['Region', 'Product']).agg(
unique_customers=('Customer', 'nunique')
).reset_index()
print(result_df)
Output:
Region Product unique_customers
0 East A 2
1 West B 1
Handling NaN Values
By default, nunique() excludes NaN from the count. If you need to treat missing values as a distinct category, use dropna=False:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Group': ['A', 'A', 'A', 'B', 'B'],
'Value': ['X', 'Y', np.nan, 'X', np.nan]
})
# Default: NaN is excluded
print("Excluding NaN:")
print(df.groupby('Group')['Value'].nunique())
print()
# Include NaN as a distinct value
print("Including NaN:")
print(df.groupby('Group')['Value'].nunique(dropna=False))
Output:
Excluding NaN:
Group
A 2
B 1
Name: Value, dtype: int64
Including NaN:
Group
A 3
B 2
Name: Value, dtype: int64
With the default behavior, Group A has 2 unique values (X, Y) and Group B has 1 (X). With dropna=False, NaN is counted as an additional distinct value, giving Group A 3 (X, Y, NaN) and Group B 2 (X, NaN).
In most analytical contexts, the default behavior of excluding NaN is what you want. Use dropna=False only when the absence of a value is itself meaningful, such as when tracking how many responses were left blank in a survey.
Practical Example: Daily Sales Dashboard
Here is a more complete example that combines several techniques into a typical reporting workflow:
import pandas as pd
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'Customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Alice'],
'Product': ['Laptop', 'Phone', 'Phone', 'Laptop', 'Tablet'],
'Amount': [1200, 800, 400, 1100, 600]
})
dashboard = df.groupby('Date').agg(
total_revenue=('Amount', 'sum'),
avg_order=('Amount', 'mean'),
order_count=('Amount', 'count'),
unique_customers=('Customer', 'nunique'),
unique_products=('Product', 'nunique')
).reset_index()
print(dashboard)
Output:
Date total_revenue ... unique_customers unique_products
0 2024-01-01 2400 ... 2 2
1 2024-01-02 1700 ... 2 2
[2 rows x 6 columns]
This single aggregation call produces a complete daily summary with revenue totals, average order size, transaction counts, and distinct counts for both customers and products.
Quick Reference
| Goal | Syntax |
|---|---|
| Distinct count per group | df.groupby('A')['B'].nunique() |
| Multiple metrics | df.groupby('A').agg({'B': 'sum', 'C': 'nunique'}) |
| Named output columns | df.groupby('A').agg(name=('B', 'nunique')) |
| Multiple grouping columns | df.groupby(['A', 'B'])['C'].nunique() |
| Include NaN in count | .nunique(dropna=False) |
- Use
nunique()to count distinct values within groups. - For professional reporting, prefer named aggregation with
.agg()to produce clear, self-documenting column names. - Remember that
nunique()excludesNaNby default. - Use
dropna=Falseonly when missing values should be counted as a distinct category.