Skip to main content

Python Pandas: How to Find and Drop Duplicate Columns in a Pandas DataFrame

When working with real-world datasets, you may encounter DataFrames that contain duplicate columns - columns with identical data but potentially different names, or columns with the same name created during merges and concatenations. Identifying and removing these duplicates is essential for clean data analysis and avoiding redundant computations.

This guide covers how to detect duplicate columns in a Pandas DataFrame and demonstrates multiple methods to drop them, with clear examples and outputs for each approach.

Creating a Sample DataFrame​

Let's start with a DataFrame that intentionally contains a duplicate column: the Marks column has the exact same values as the Age column.

import pandas as pd

students = [
("Michael", 34, "California", 34),
("Emily", 30, "Texas", 30),
("Jacob", 16, "Florida", 16),
("Emily", 30, "New York", 30),
("Olivia", 40, "Illinois", 40),
("Daniel", 30, "Washington", 30),
]

df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])
print(df)

Output:

      Name  Age    Domicile  Marks
0 Michael 34 California 34
1 Emily 30 Texas 30
2 Jacob 16 Florida 16
3 Emily 30 New York 30
4 Olivia 40 Illinois 40
5 Daniel 30 Washington 3

Notice that Age and Marks contain identical values in every row - Marks is a duplicate of Age.

Finding Duplicate Columns​

Before dropping duplicates, you'll want to identify which columns are duplicated. There are two types of duplication to detect:

  1. Duplicate column names - columns with the same header label.
  2. Duplicate column content - columns with different names but identical data.

Detecting Columns with Identical Content​

This custom function compares every column against every other column and identifies those with matching content:

import pandas as pd

students = [
("Michael", 34, "California", 34),
("Emily", 30, "Texas", 30),
("Jacob", 16, "Florida", 16),
("Emily", 30, "New York", 30),
("Olivia", 40, "Illinois", 40),
("Daniel", 30, "Washington", 30),
]

df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])

def get_duplicate_columns(df):
"""Return a list of column names that have duplicate content."""
duplicate_columns = set()

for i in range(df.shape[1]):
col_i = df.iloc[:, i]
for j in range(i + 1, df.shape[1]):
col_j = df.iloc[:, j]
if col_i.equals(col_j):
duplicate_columns.add(df.columns[j])

return list(duplicate_columns)

# Find duplicates
duplicates = get_duplicate_columns(df)
print(f"Duplicate columns: {duplicates}")

Output:

Duplicate columns: ['Marks']

The function correctly identifies that Marks is a content-duplicate of Age.

Detecting Duplicate Column Names​

If your DataFrame has columns with the same name (common after merges), use df.columns.duplicated():

import pandas as pd

students = [
("Michael", 34, "California", 34),
("Emily", 30, "Texas", 30),
("Jacob", 16, "Florida", 16),
("Emily", 30, "New York", 30),
("Olivia", 40, "Illinois", 40),
("Daniel", 30, "Washington", 30),
]

df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])

# Create a DataFrame with duplicate column names
df_dup_names = pd.DataFrame(
[[1, 2, 3], [4, 5, 6]],
columns=["A", "B", "A"]
)

print("Duplicate column names:")
print(df_dup_names.columns[df_dup_names.columns.duplicated()].tolist())

Output:

Duplicate column names:
['A']

Method 1: Drop Duplicate Content Using Transpose and drop_duplicates()​

A concise approach is to transpose the DataFrame (swap rows and columns), apply drop_duplicates() to remove duplicate rows (which were originally duplicate columns), then transpose back:

import pandas as pd

students = [
("Michael", 34, "California", 34),
("Emily", 30, "Texas", 30),
("Jacob", 16, "Florida", 16),
("Emily", 30, "New York", 30),
("Olivia", 40, "Illinois", 40),
("Daniel", 30, "Washington", 30),
]

df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])


df_clean = df.T.drop_duplicates().T
print(df_clean)

Output:

      Name Age    Domicile
0 Michael 34 California
1 Emily 30 Texas
2 Jacob 16 Florida
3 Emily 30 New York
4 Olivia 40 Illinois
5 Daniel 30 Washington

The Marks column has been removed because its content is identical to Age.

caution

This method has two potential issues:

  1. Data type changes - transposing converts all values to a common dtype (usually object), which may alter numeric columns. Verify dtypes after the operation:
    print(df_clean.dtypes)
  2. Performance - transposing large DataFrames is memory-intensive and slow. For large datasets, use one of the other methods below.

Method 2: Drop Duplicate Column Names Using df.loc[]​

If your duplicate columns share the same name, use df.columns.duplicated() with loc[] to keep only the first occurrence:

import pandas as pd

# Create a DataFrame with duplicate column names
df_dup = pd.DataFrame(
{"A": [1, 2, 3], "B": [4, 5, 6], "A_dup": [1, 2, 3]},
)
df_dup.columns = ["A", "B", "A"] # Force duplicate names

print("Before:")
print(df_dup)

# Keep only the first occurrence of each column name
df_clean = df_dup.loc[:, ~df_dup.columns.duplicated()]

print("\nAfter:")
print(df_clean)

Output:

Before:
A B A
0 1 4 1
1 2 5 2
2 3 6 3

After:
A B
0 1 4
1 2 5
2 3 6
  • df.columns.duplicated() returns a boolean array - True for column names that have appeared before.
  • The ~ operator negates it, selecting only the first occurrence of each name.
  • df.loc[:, mask] selects all rows and only the non-duplicate columns.
tip

To keep the last occurrence instead of the first, pass keep='last':

df_clean = df_dup.loc[:, ~df_dup.columns.duplicated(keep="last")]

Method 3: Drop Using df.drop() with a Custom Detection Function​

Combine the duplicate detection function from earlier with DataFrame.drop() for a clear, explicit approach:

import pandas as pd

def get_duplicate_columns(df):
"""Return column names whose content duplicates an earlier column."""
duplicate_columns = set()
for i in range(df.shape[1]):
for j in range(i + 1, df.shape[1]):
if df.iloc[:, i].equals(df.iloc[:, j]):
duplicate_columns.add(df.columns[j])
return list(duplicate_columns)

# Original DataFrame
students = [
("Michael", 34, "California", 34),
("Emily", 30, "Texas", 30),
("Jacob", 16, "Florida", 16),
("Emily", 30, "New York", 30),
("Olivia", 40, "Illinois", 40),
("Daniel", 30, "Washington", 30),
]
df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])

# Find and drop duplicate columns
duplicate_cols = get_duplicate_columns(df)
print(f"Dropping columns: {duplicate_cols}")

df_clean = df.drop(columns=duplicate_cols)
print(df_clean)

Output:

Dropping columns: ['Marks']
Name Age Domicile
0 Michael 34 California
1 Emily 30 Texas
2 Jacob 16 Florida
3 Emily 30 New York
4 Olivia 40 Illinois
5 Daniel 30 Washington

This approach preserves the original data types and gives you full control over which columns are removed.

Method 4: Using Column Correlation for Near-Duplicates​

Sometimes columns aren't exactly identical but are highly correlated (e.g., due to rounding differences). You can use correlation to detect near-duplicates:

import numpy as np
import pandas as pd

df_numeric = pd.DataFrame({
"A": [1.0, 2.0, 3.0, 4.0, 5.0],
"B": [10, 20, 30, 40, 50],
"C": [1.001, 2.001, 3.001, 4.001, 5.001], # Near-duplicate of A
})

# Compute correlation matrix
corr_matrix = df_numeric.corr().abs()

# Select upper triangle (excluding diagonal)
upper_triangle = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

# Find columns with correlation > 0.99
threshold = 0.99
to_drop = [col for col in upper_triangle.columns if any(upper_triangle[col] > threshold)]

print(f"Near-duplicate columns to drop: {to_drop}")
df_clean = df_numeric.drop(columns=to_drop)
print(df_clean)

Output:

Near-duplicate columns to drop: ['B', 'C']
A
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
note

This method only works with numeric columns and detects linear relationships. Adjust the correlation threshold based on your tolerance for similarity.

Common Mistake: Using inplace=True with Chained Operations​

Dropping columns with inplace=True on a filtered DataFrame can lead to unexpected behavior:

# āŒ Potentially problematic: modifies the original DataFrame
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)

While this works in simple cases, chained operations with inplace=True can cause SettingWithCopyWarning. The safer approach is to assign the result to a new variable:

# āœ… Safer: creates a clean copy
df_clean = df.drop(columns=duplicate_cols)

Comparison of Methods​

MethodDetects Content DuplicatesDetects Name DuplicatesPreserves DtypesPerformance
Transpose + drop_duplicates()āœ…āœ…āŒ (may change)āš ļø Slow for large data
df.loc[:, ~duplicated()]āŒāœ…āœ…āœ… Fast
Custom function + df.drop()āœ…āœ…āœ…āš ļø O(n²) columns
Correlation-based detectionāœ… (near-duplicates)āŒāœ…āœ… Fast (numeric only)

Conclusion​

Duplicate columns waste memory, cause confusion, and can lead to errors in analysis. Pandas provides several ways to find and remove them:

  • For duplicate column names, use df.loc[:, ~df.columns.duplicated()] - it's fast, simple, and preserves data types.
  • For duplicate column content (different names, same data), use a custom comparison function with df.drop() for the most reliable results.
  • For a quick one-liner, transpose with df.T.drop_duplicates().T works but may alter data types.
  • For near-duplicates in numeric data, correlation analysis with a threshold can catch columns that are almost identical.

Choose the method that matches your specific scenario - whether you're dealing with duplicate names from a merge operation or redundant data columns in a large dataset.