Python Pandas: How to Count Duplicates in a Pandas DataFrame
Identifying and counting duplicate entries is a critical step in data cleaning and analysis. Duplicates can skew your results, inflate counts, and lead to incorrect conclusions. Whether you're auditing data quality, profiling a dataset, or preparing data for modeling, knowing how to count duplicates efficiently is essential.
In this guide, you'll learn multiple methods to count duplicates in a pandas DataFrame - across single columns, multiple columns, and the entire DataFrame - with clear examples and practical use cases.
Setting Up the Example DataFrame
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
print(df)
Output:
Name Course City
0 Alice Python London
1 Bob Java Paris
2 Charlie Python London
3 Alice Python London
4 Bob Java Paris
5 Alice SQL Berlin
Counting Duplicates in a Single Column
Using value_counts()
The most straightforward way to count how many times each value appears in a column:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
name_counts = df['Name'].value_counts()
print(name_counts)
Output:
Name
Alice 3
Bob 2
Charlie 1
Name: count, dtype: int64
To see only values that appear more than once (actual duplicates):
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
name_counts = df['Name'].value_counts()
duplicates_only = name_counts[name_counts > 1]
print("Duplicated values:")
print(duplicates_only)
Output:
Duplicated values:
Name
Alice 3
Bob 2
Name: count, dtype: int64
Using pivot_table()
The pivot_table() function with aggfunc='size' counts occurrences for each unique value:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
course_counts = df.pivot_table(index='Course', aggfunc='size')
print(course_counts)
Output:
Course
Java 2
Python 3
SQL 1
dtype: int64
Using groupby() with size()
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
city_counts = df.groupby('City').size()
print(city_counts)
Output:
City
Berlin 1
London 3
Paris 2
dtype: int64
Counting Duplicates Across Multiple Columns
To count duplicates based on combinations of column values, pass multiple columns:
Using pivot_table()
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
combo_counts = df.pivot_table(index=['Name', 'Course'], aggfunc='size')
print(combo_counts)
Output:
Name Course
Alice Python 2
SQL 1
Bob Java 2
Charlie Python 1
dtype: int64
The combination ('Alice', 'Python') appears twice and ('Bob', 'Java') appears twice.
Using groupby() with size()
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
combo_counts = df.groupby(['Name', 'City']).size().reset_index(name='Count')
print(combo_counts)
Output:
Name City Count
0 Alice Berlin 1
1 Alice London 2
2 Bob Paris 2
3 Charlie London 1
This clearly shows that Alice in London and Bob in Paris each appear twice.
Counting Total Duplicate Rows
Using duplicated()
The duplicated() method marks rows as True if they are duplicates of a previous row. Summing gives the total count:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
# Count fully duplicate rows (all columns match)
total_duplicates = df.duplicated().sum()
print(f"Total duplicate rows: {total_duplicates}")
# Show which rows are duplicates
print("\nDuplicate flags:")
print(df.duplicated())
Output:
Total duplicate rows: 2
Duplicate flags:
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
Rows 3 and 4 are duplicates of rows 0 and 1 respectively (when considering all columns).
Viewing the Actual Duplicate Rows
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
duplicate_rows = df[df.duplicated(keep=False)]
print("All rows involved in duplication:")
print(duplicate_rows)
Output:
All rows involved in duplication:
Name Course City
0 Alice Python London
1 Bob Java Paris
3 Alice Python London
4 Bob Java Paris
keep Parameter Optionskeep='first'(default): Marks all duplicates except the first occurrence.keep='last': Marks all duplicates except the last occurrence.keep=False: Marks all duplicate rows, including the first occurrence.
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
duplicate_rows = df[df.duplicated(keep=False)]
print(f"Duplicates (keep first): {df.duplicated(keep='first').sum()}")
print(f"Duplicates (keep last): {df.duplicated(keep='last').sum()}")
print(f"All duplicated rows: {df.duplicated(keep=False).sum()}")
Output:
Duplicates (keep first): 2
Duplicates (keep last): 2
All duplicated rows: 4
Counting Duplicates Based on Specific Columns
Check for duplicates considering only certain columns with the subset parameter:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
duplicate_rows = df[df.duplicated(keep=False)]
# Duplicates based on 'Name' column only
name_dups = df.duplicated(subset=['Name']).sum()
print(f"Duplicate names: {name_dups}")
# Duplicates based on 'Name' and 'Course' combination
combo_dups = df.duplicated(subset=['Name', 'Course']).sum()
print(f"Duplicate name-course combos: {combo_dups}")
Output:
Duplicate names: 3
Duplicate name-course combos: 2
Building a Comprehensive Duplicate Report
Here's a practical function that generates a complete duplicate analysis:
import pandas as pd
def duplicate_report(df, columns=None):
"""Generate a comprehensive duplicate report for a DataFrame."""
if columns:
subset = columns
label = f"columns {columns}"
else:
subset = None
label = "all columns"
total_rows = len(df)
dup_count = df.duplicated(subset=subset).sum()
unique_count = total_rows - dup_count
print(f"Duplicate Report ({label})")
print("=" * 40)
print(f"Total rows: {total_rows}")
print(f"Unique rows: {unique_count}")
print(f"Duplicate rows: {dup_count}")
print(f"Duplicate %: {dup_count / total_rows * 100:.1f}%")
if dup_count > 0 and columns:
print(f"\nValue counts:")
counts = df.groupby(columns).size().reset_index(name='Count')
counts = counts[counts['Count'] > 1].sort_values('Count', ascending=False)
print(counts.to_string(index=False))
# Example usage
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
duplicate_report(df, columns=['Name'])
print()
duplicate_report(df, columns=['Name', 'Course'])
Output:
Duplicate Report (columns ['Name'])
========================================
Total rows: 6
Unique rows: 3
Duplicate rows: 3
Duplicate %: 50.0%
Value counts:
Name Count
Alice 3
Bob 2
Duplicate Report (columns ['Name', 'Course'])
========================================
Total rows: 6
Unique rows: 4
Duplicate rows: 2
Duplicate %: 33.3%
Value counts:
Name Course Count
Alice Python 2
Bob Java 2
Quick Comparison of Methods
| Method | Single Column | Multiple Columns | Shows Counts | Returns |
|---|---|---|---|---|
value_counts() | ✅ | ❌ | ✅ | Series with counts |
pivot_table(aggfunc='size') | ✅ | ✅ | ✅ | Series/DataFrame with counts |
groupby().size() | ✅ | ✅ | ✅ | Series with counts |
duplicated().sum() | ✅ | ✅ | ❌ (total only) | Integer |
df[df.duplicated()] | ✅ | ✅ | ❌ | DataFrame of duplicate rows |
Conclusion
Pandas provides multiple methods to count duplicates, each suited to different needs:
- Use
value_counts()for a quick count of how often each value appears in a single column. - Use
groupby().size()orpivot_table()to count duplicates across one or more columns, showing the frequency of each combination. - Use
duplicated().sum()to get a total count of duplicate rows in the DataFrame. - Use
df[df.duplicated(keep=False)]to see all rows involved in duplication. - Use the
subsetparameter withduplicated()to check for duplicates based on specific columns only.
For most data cleaning workflows, start with duplicated().sum() to get a quick overview, then use value_counts() or groupby() to drill down into which values are duplicated and how many times.