Skip to main content

Python Pandas: How to Count Duplicates in a Pandas DataFrame

Identifying and counting duplicate entries is a critical step in data cleaning and analysis. Duplicates can skew your results, inflate counts, and lead to incorrect conclusions. Whether you're auditing data quality, profiling a dataset, or preparing data for modeling, knowing how to count duplicates efficiently is essential.

In this guide, you'll learn multiple methods to count duplicates in a pandas DataFrame - across single columns, multiple columns, and the entire DataFrame - with clear examples and practical use cases.

Setting Up the Example DataFrame

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

print(df)

Output:

      Name  Course    City
0 Alice Python London
1 Bob Java Paris
2 Charlie Python London
3 Alice Python London
4 Bob Java Paris
5 Alice SQL Berlin

Counting Duplicates in a Single Column

Using value_counts()

The most straightforward way to count how many times each value appears in a column:

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

name_counts = df['Name'].value_counts()
print(name_counts)

Output:

Name
Alice 3
Bob 2
Charlie 1
Name: count, dtype: int64

To see only values that appear more than once (actual duplicates):

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

name_counts = df['Name'].value_counts()
duplicates_only = name_counts[name_counts > 1]
print("Duplicated values:")
print(duplicates_only)

Output:

Duplicated values:
Name
Alice 3
Bob 2
Name: count, dtype: int64

Using pivot_table()

The pivot_table() function with aggfunc='size' counts occurrences for each unique value:

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

course_counts = df.pivot_table(index='Course', aggfunc='size')
print(course_counts)

Output:

Course
Java 2
Python 3
SQL 1
dtype: int64

Using groupby() with size()

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

city_counts = df.groupby('City').size()
print(city_counts)

Output:

City
Berlin 1
London 3
Paris 2
dtype: int64

Counting Duplicates Across Multiple Columns

To count duplicates based on combinations of column values, pass multiple columns:

Using pivot_table()

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

combo_counts = df.pivot_table(index=['Name', 'Course'], aggfunc='size')
print(combo_counts)

Output:

Name     Course
Alice Python 2
SQL 1
Bob Java 2
Charlie Python 1
dtype: int64

The combination ('Alice', 'Python') appears twice and ('Bob', 'Java') appears twice.

Using groupby() with size()

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

combo_counts = df.groupby(['Name', 'City']).size().reset_index(name='Count')
print(combo_counts)

Output:

      Name    City  Count
0 Alice Berlin 1
1 Alice London 2
2 Bob Paris 2
3 Charlie London 1

This clearly shows that Alice in London and Bob in Paris each appear twice.

Counting Total Duplicate Rows

Using duplicated()

The duplicated() method marks rows as True if they are duplicates of a previous row. Summing gives the total count:

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})


# Count fully duplicate rows (all columns match)
total_duplicates = df.duplicated().sum()
print(f"Total duplicate rows: {total_duplicates}")

# Show which rows are duplicates
print("\nDuplicate flags:")
print(df.duplicated())

Output:

Total duplicate rows: 2

Duplicate flags:
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool

Rows 3 and 4 are duplicates of rows 0 and 1 respectively (when considering all columns).

Viewing the Actual Duplicate Rows

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

duplicate_rows = df[df.duplicated(keep=False)]
print("All rows involved in duplication:")
print(duplicate_rows)

Output:

All rows involved in duplication:
Name Course City
0 Alice Python London
1 Bob Java Paris
3 Alice Python London
4 Bob Java Paris
keep Parameter Options
  • keep='first' (default): Marks all duplicates except the first occurrence.
  • keep='last': Marks all duplicates except the last occurrence.
  • keep=False: Marks all duplicate rows, including the first occurrence.
import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
duplicate_rows = df[df.duplicated(keep=False)]

print(f"Duplicates (keep first): {df.duplicated(keep='first').sum()}")
print(f"Duplicates (keep last): {df.duplicated(keep='last').sum()}")
print(f"All duplicated rows: {df.duplicated(keep=False).sum()}")

Output:

Duplicates (keep first): 2
Duplicates (keep last): 2
All duplicated rows: 4

Counting Duplicates Based on Specific Columns

Check for duplicates considering only certain columns with the subset parameter:

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
duplicate_rows = df[df.duplicated(keep=False)]

# Duplicates based on 'Name' column only
name_dups = df.duplicated(subset=['Name']).sum()
print(f"Duplicate names: {name_dups}")

# Duplicates based on 'Name' and 'Course' combination
combo_dups = df.duplicated(subset=['Name', 'Course']).sum()
print(f"Duplicate name-course combos: {combo_dups}")

Output:

Duplicate names: 3
Duplicate name-course combos: 2

Building a Comprehensive Duplicate Report

Here's a practical function that generates a complete duplicate analysis:

import pandas as pd

def duplicate_report(df, columns=None):
"""Generate a comprehensive duplicate report for a DataFrame."""
if columns:
subset = columns
label = f"columns {columns}"
else:
subset = None
label = "all columns"

total_rows = len(df)
dup_count = df.duplicated(subset=subset).sum()
unique_count = total_rows - dup_count

print(f"Duplicate Report ({label})")
print("=" * 40)
print(f"Total rows: {total_rows}")
print(f"Unique rows: {unique_count}")
print(f"Duplicate rows: {dup_count}")
print(f"Duplicate %: {dup_count / total_rows * 100:.1f}%")

if dup_count > 0 and columns:
print(f"\nValue counts:")
counts = df.groupby(columns).size().reset_index(name='Count')
counts = counts[counts['Count'] > 1].sort_values('Count', ascending=False)
print(counts.to_string(index=False))


# Example usage
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

duplicate_report(df, columns=['Name'])
print()
duplicate_report(df, columns=['Name', 'Course'])

Output:

Duplicate Report (columns ['Name'])
========================================
Total rows: 6
Unique rows: 3
Duplicate rows: 3
Duplicate %: 50.0%

Value counts:
Name Count
Alice 3
Bob 2

Duplicate Report (columns ['Name', 'Course'])
========================================
Total rows: 6
Unique rows: 4
Duplicate rows: 2
Duplicate %: 33.3%

Value counts:
Name Course Count
Alice Python 2
Bob Java 2

Quick Comparison of Methods

MethodSingle ColumnMultiple ColumnsShows CountsReturns
value_counts()Series with counts
pivot_table(aggfunc='size')Series/DataFrame with counts
groupby().size()Series with counts
duplicated().sum()❌ (total only)Integer
df[df.duplicated()]DataFrame of duplicate rows

Conclusion

Pandas provides multiple methods to count duplicates, each suited to different needs:

  • Use value_counts() for a quick count of how often each value appears in a single column.
  • Use groupby().size() or pivot_table() to count duplicates across one or more columns, showing the frequency of each combination.
  • Use duplicated().sum() to get a total count of duplicate rows in the DataFrame.
  • Use df[df.duplicated(keep=False)] to see all rows involved in duplication.
  • Use the subset parameter with duplicated() to check for duplicates based on specific columns only.

For most data cleaning workflows, start with duplicated().sum() to get a quick overview, then use value_counts() or groupby() to drill down into which values are duplicated and how many times.