Skip to main content

Python Pandas: How to Count Distinct Values in Pandas with GroupBy

Counting unique occurrences within groups is a fundamental operation in business analytics and data exploration. Questions like "how many unique customers purchased each day?", "how many distinct products were sold per region?", or "how many different error types occurred per server?" all require counting distinct values within grouped data. Pandas provides the nunique() method specifically for this purpose, and it integrates seamlessly with groupby() for grouped distinct counts.

In this guide, you will learn how to count distinct values within groups, combine distinct counts with other aggregations, handle missing values, and produce clean output suitable for reporting.

Basic Distinct Count with nunique()

The simplest case is counting unique values in one column after grouping by another. The nunique() method returns the number of distinct values within each group:

import pandas as pd

df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie']
})

# Count unique users per date
unique_users = df.groupby('Date')['User'].nunique()

print(unique_users)

Output:

Date
2024-01-01 2
2024-01-02 1
Name: User, dtype: int64

On January 1st, Alice appears twice and Bob once, but there are only 2 unique users. On January 2nd, Charlie appears twice but counts as only 1 unique user. The nunique() method automatically handles the deduplication.

Combining Distinct Counts with Other Aggregations

In practice, you often need distinct counts alongside other metrics like totals or averages. The .agg() method lets you apply different aggregation functions to different columns in a single call:

import pandas as pd

df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'User': ['Alice', 'Bob', 'Alice', 'Alice'],
'Sales': [100, 150, 200, 75]
})

result = df.groupby('Date').agg({
'Sales': 'sum',
'User': 'nunique'
})

print(result)

Output:

            Sales  User
Date
2024-01-01 250 2
2024-01-02 275 1

This tells you that January 1st had $250 in total sales from 2 unique customers, while January 2nd had $275 from just 1 customer (Alice, who made two purchases).

Named Aggregation for Clean Output

The dictionary-based .agg() approach reuses the original column names, which can be unclear in reports. Named aggregation produces descriptive column names directly:

import pandas as pd

df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02'],
'User': ['Alice', 'Bob', 'Alice'],
'Sales': [100, 150, 200]
})

result = df.groupby('Date').agg(
total_revenue=('Sales', 'sum'),
unique_customers=('User', 'nunique'),
transactions=('Sales', 'count')
).reset_index()

print(result)
         Date  total_revenue  unique_customers  transactions
0 2024-01-01 250 2 2
1 2024-01-02 200 1 1

Each column name clearly describes what it contains, making the output self-documenting and ready for reporting or dashboards.

tip

Named aggregation uses the syntax output_name=('source_column', 'function'). This is the recommended approach for production code because it eliminates ambiguity and produces results that are immediately understandable without additional context.

Grouping by Multiple Columns

You can group by two or more columns to get distinct counts at a finer granularity:

import pandas as pd

df = pd.DataFrame({
'Region': ['East', 'East', 'East', 'West', 'West'],
'Product': ['A', 'A', 'A', 'B', 'B'],
'Customer': ['C1', 'C2', 'C1', 'C1', 'C1']
})

result = df.groupby(['Region', 'Product'])['Customer'].nunique()

print(result)

Output:

Region  Product
East A 2
West B 1
Name: Customer, dtype: int64

Product A in the East region has 2 unique customers (C1 and C2), while Product B in the West region has only 1 (C1 appeared twice but is counted once).

Flattening the Result for Export

import pandas as pd

df = pd.DataFrame({
'Region': ['East', 'East', 'East', 'West', 'West'],
'Product': ['A', 'A', 'A', 'B', 'B'],
'Customer': ['C1', 'C2', 'C1', 'C1', 'C1']
})


result_df = df.groupby(['Region', 'Product']).agg(
unique_customers=('Customer', 'nunique')
).reset_index()

print(result_df)

Output:

  Region Product  unique_customers
0 East A 2
1 West B 1

Handling NaN Values

By default, nunique() excludes NaN from the count. If you need to treat missing values as a distinct category, use dropna=False:

import pandas as pd
import numpy as np

df = pd.DataFrame({
'Group': ['A', 'A', 'A', 'B', 'B'],
'Value': ['X', 'Y', np.nan, 'X', np.nan]
})

# Default: NaN is excluded
print("Excluding NaN:")
print(df.groupby('Group')['Value'].nunique())
print()

# Include NaN as a distinct value
print("Including NaN:")
print(df.groupby('Group')['Value'].nunique(dropna=False))

Output:

Excluding NaN:
Group
A 2
B 1
Name: Value, dtype: int64

Including NaN:
Group
A 3
B 2
Name: Value, dtype: int64

With the default behavior, Group A has 2 unique values (X, Y) and Group B has 1 (X). With dropna=False, NaN is counted as an additional distinct value, giving Group A 3 (X, Y, NaN) and Group B 2 (X, NaN).

info

In most analytical contexts, the default behavior of excluding NaN is what you want. Use dropna=False only when the absence of a value is itself meaningful, such as when tracking how many responses were left blank in a survey.

Practical Example: Daily Sales Dashboard

Here is a more complete example that combines several techniques into a typical reporting workflow:

import pandas as pd

df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'Customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Alice'],
'Product': ['Laptop', 'Phone', 'Phone', 'Laptop', 'Tablet'],
'Amount': [1200, 800, 400, 1100, 600]
})

dashboard = df.groupby('Date').agg(
total_revenue=('Amount', 'sum'),
avg_order=('Amount', 'mean'),
order_count=('Amount', 'count'),
unique_customers=('Customer', 'nunique'),
unique_products=('Product', 'nunique')
).reset_index()

print(dashboard)

Output:

         Date  total_revenue  ...  unique_customers  unique_products
0 2024-01-01 2400 ... 2 2
1 2024-01-02 1700 ... 2 2

[2 rows x 6 columns]

This single aggregation call produces a complete daily summary with revenue totals, average order size, transaction counts, and distinct counts for both customers and products.

Quick Reference

GoalSyntax
Distinct count per groupdf.groupby('A')['B'].nunique()
Multiple metricsdf.groupby('A').agg({'B': 'sum', 'C': 'nunique'})
Named output columnsdf.groupby('A').agg(name=('B', 'nunique'))
Multiple grouping columnsdf.groupby(['A', 'B'])['C'].nunique()
Include NaN in count.nunique(dropna=False)
  • Use nunique() to count distinct values within groups.
  • For professional reporting, prefer named aggregation with .agg() to produce clear, self-documenting column names.
  • Remember that nunique() excludes NaN by default.
  • Use dropna=False only when missing values should be counted as a distinct category.