Python Pandas: How to Count Duplicates in a Pandas DataFrame

Identifying and counting duplicate entries is a critical step in data cleaning and analysis. Duplicates can skew your results, inflate counts, and lead to incorrect conclusions. Whether you're auditing data quality, profiling a dataset, or preparing data for modeling, knowing how to count duplicates efficiently is essential.

In this guide, you'll learn multiple methods to count duplicates in a pandas DataFrame - across single columns, multiple columns, and the entire DataFrame - with clear examples and practical use cases.

Setting Up the Example DataFrame

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

print(df)

Output:

      Name  Course    City
  Alice  Python  London
    Bob    Java   Paris
Charlie  Python  London
  Alice  Python  London
    Bob    Java   Paris
  Alice     SQL  Berlin

Counting Duplicates in a Single Column

Using `value_counts()`

The most straightforward way to count how many times each value appears in a column:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

name_counts = df['Name'].value_counts()
print(name_counts)

Output:

Name
Alice      3
Bob        2
Charlie    1
Name: count, dtype: int64

To see only values that appear more than once (actual duplicates):

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

name_counts = df['Name'].value_counts()
duplicates_only = name_counts[name_counts > 1]
print("Duplicated values:")
print(duplicates_only)

Output:

Duplicated values:
Name
Alice    3
Bob      2
Name: count, dtype: int64

Using `pivot_table()`

The pivot_table() function with aggfunc='size' counts occurrences for each unique value:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

course_counts = df.pivot_table(index='Course', aggfunc='size')
print(course_counts)

Output:

Course
Java      2
Python    3
SQL       1
dtype: int64

Using `groupby()` with `size()`

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

city_counts = df.groupby('City').size()
print(city_counts)

Output:

City
Berlin    1
London    3
Paris     2
dtype: int64

Counting Duplicates Across Multiple Columns

To count duplicates based on combinations of column values, pass multiple columns:

Using `pivot_table()`

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

combo_counts = df.pivot_table(index=['Name', 'Course'], aggfunc='size')
print(combo_counts)

Output:

Name     Course
Alice    Python    2
         SQL       1
Bob      Java      2
Charlie  Python    1
dtype: int64

The combination ('Alice', 'Python') appears twice and ('Bob', 'Java') appears twice.

Using `groupby()` with `size()`

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

combo_counts = df.groupby(['Name', 'City']).size().reset_index(name='Count')
print(combo_counts)

Output:

      Name    City  Count
  Alice  Berlin      1
  Alice  London      2
    Bob   Paris      2
Charlie  London      1

This clearly shows that Alice in London and Bob in Paris each appear twice.

Counting Total Duplicate Rows

Using `duplicated()`

The duplicated() method marks rows as True if they are duplicates of a previous row. Summing gives the total count:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})


# Count fully duplicate rows (all columns match)
total_duplicates = df.duplicated().sum()
print(f"Total duplicate rows: {total_duplicates}")

# Show which rows are duplicates
print("\nDuplicate flags:")
print(df.duplicated())

Output:

Total duplicate rows: 2

Duplicate flags:
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

Rows 3 and 4 are duplicates of rows 0 and 1 respectively (when considering all columns).

Viewing the Actual Duplicate Rows

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

duplicate_rows = df[df.duplicated(keep=False)]
print("All rows involved in duplication:")
print(duplicate_rows)

Output:

All rows involved in duplication:
    Name  Course    City
0  Alice  Python  London
1    Bob    Java   Paris
3  Alice  Python  London
4    Bob    Java   Paris

keep Parameter Options

keep='first' (default): Marks all duplicates except the first occurrence.
keep='last': Marks all duplicates except the last occurrence.
keep=False: Marks all duplicate rows, including the first occurrence.

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
duplicate_rows = df[df.duplicated(keep=False)]

print(f"Duplicates (keep first): {df.duplicated(keep='first').sum()}")
print(f"Duplicates (keep last):  {df.duplicated(keep='last').sum()}")
print(f"All duplicated rows:     {df.duplicated(keep=False).sum()}")

Output:

Duplicates (keep first): 2
Duplicates (keep last):  2
All duplicated rows:     4

Counting Duplicates Based on Specific Columns

Check for duplicates considering only certain columns with the subset parameter:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})
duplicate_rows = df[df.duplicated(keep=False)]

# Duplicates based on 'Name' column only
name_dups = df.duplicated(subset=['Name']).sum()
print(f"Duplicate names: {name_dups}")

# Duplicates based on 'Name' and 'Course' combination
combo_dups = df.duplicated(subset=['Name', 'Course']).sum()
print(f"Duplicate name-course combos: {combo_dups}")

Output:

Duplicate names: 3
Duplicate name-course combos: 2

Building a Comprehensive Duplicate Report

Here's a practical function that generates a complete duplicate analysis:

import pandas as pd

def duplicate_report(df, columns=None):
    """Generate a comprehensive duplicate report for a DataFrame."""
    if columns:
        subset = columns
        label = f"columns {columns}"
    else:
        subset = None
        label = "all columns"

    total_rows = len(df)
    dup_count = df.duplicated(subset=subset).sum()
    unique_count = total_rows - dup_count

    print(f"Duplicate Report ({label})")
    print("=" * 40)
    print(f"Total rows:     {total_rows}")
    print(f"Unique rows:    {unique_count}")
    print(f"Duplicate rows: {dup_count}")
    print(f"Duplicate %:    {dup_count / total_rows * 100:.1f}%")

    if dup_count > 0 and columns:
        print(f"\nValue counts:")
        counts = df.groupby(columns).size().reset_index(name='Count')
        counts = counts[counts['Count'] > 1].sort_values('Count', ascending=False)
        print(counts.to_string(index=False))


# Example usage
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Alice'],
    'Course': ['Python', 'Java', 'Python', 'Python', 'Java', 'SQL'],
    'City': ['London', 'Paris', 'London', 'London', 'Paris', 'Berlin']
})

duplicate_report(df, columns=['Name'])
print()
duplicate_report(df, columns=['Name', 'Course'])

Output:

Duplicate Report (columns ['Name'])
========================================
Total rows:     6
Unique rows:    3
Duplicate rows: 3
Duplicate %:    50.0%

Value counts:
 Name  Count
Alice      3
  Bob      2

Duplicate Report (columns ['Name', 'Course'])
========================================
Total rows:     6
Unique rows:    4
Duplicate rows: 2
Duplicate %:    33.3%

Value counts:
 Name Course  Count
Alice Python      2
  Bob   Java      2

Quick Comparison of Methods

Method	Single Column	Multiple Columns	Shows Counts	Returns
`value_counts()`	✅	❌	✅	Series with counts
`pivot_table(aggfunc='size')`	✅	✅	✅	Series/DataFrame with counts
`groupby().size()`	✅	✅	✅	Series with counts
`duplicated().sum()`	✅	✅	❌ (total only)	Integer
`df[df.duplicated()]`	✅	✅	❌	DataFrame of duplicate rows

Conclusion

Pandas provides multiple methods to count duplicates, each suited to different needs:

Use value_counts() for a quick count of how often each value appears in a single column.
Use groupby().size() or pivot_table() to count duplicates across one or more columns, showing the frequency of each combination.
Use duplicated().sum() to get a total count of duplicate rows in the DataFrame.
Use df[df.duplicated(keep=False)] to see all rows involved in duplication.
Use the subset parameter with duplicated() to check for duplicates based on specific columns only.

For most data cleaning workflows, start with duplicated().sum() to get a quick overview, then use value_counts() or groupby() to drill down into which values are duplicated and how many times.

Setting Up the Example DataFrame​

Counting Duplicates in a Single Column​

Using value_counts()​

Using pivot_table()​

Using groupby() with size()​

Counting Duplicates Across Multiple Columns​

Using pivot_table()​

Using groupby() with size()​

Counting Total Duplicate Rows​

Using duplicated()​

Viewing the Actual Duplicate Rows​

Counting Duplicates Based on Specific Columns​

Building a Comprehensive Duplicate Report​

Quick Comparison of Methods​

Conclusion​

Table of Contents

Setting Up the Example DataFrame

Counting Duplicates in a Single Column

Using `value_counts()`

Using `pivot_table()`

Using `groupby()` with `size()`

Counting Duplicates Across Multiple Columns

Using `pivot_table()`

Using `groupby()` with `size()`

Counting Total Duplicate Rows

Using `duplicated()`

Viewing the Actual Duplicate Rows

Counting Duplicates Based on Specific Columns

Building a Comprehensive Duplicate Report

Quick Comparison of Methods

Conclusion