Python Pandas: How to Find and Drop Duplicate Columns in a Pandas DataFrame

When working with real-world datasets, you may encounter DataFrames that contain duplicate columns - columns with identical data but potentially different names, or columns with the same name created during merges and concatenations. Identifying and removing these duplicates is essential for clean data analysis and avoiding redundant computations.

This guide covers how to detect duplicate columns in a Pandas DataFrame and demonstrates multiple methods to drop them, with clear examples and outputs for each approach.

Creating a Sample DataFrame

Let's start with a DataFrame that intentionally contains a duplicate column: the Marks column has the exact same values as the Age column.

import pandas as pd

students = [
    ("Michael", 34, "California", 34),
    ("Emily", 30, "Texas", 30),
    ("Jacob", 16, "Florida", 16),
    ("Emily", 30, "New York", 30),
    ("Olivia", 40, "Illinois", 40),
    ("Daniel", 30, "Washington", 30),
]

df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])
print(df)

Output:

      Name  Age    Domicile  Marks
Michael   34  California     34
  Emily   30       Texas     30
  Jacob   16     Florida     16
  Emily   30    New York     30
 Olivia   40    Illinois     40
 Daniel   30  Washington     3

Notice that Age and Marks contain identical values in every row - Marks is a duplicate of Age.

Finding Duplicate Columns

Before dropping duplicates, you'll want to identify which columns are duplicated. There are two types of duplication to detect:

Duplicate column names - columns with the same header label.
Duplicate column content - columns with different names but identical data.

Detecting Columns with Identical Content

This custom function compares every column against every other column and identifies those with matching content:

import pandas as pd

students = [
    ("Michael", 34, "California", 34),
    ("Emily", 30, "Texas", 30),
    ("Jacob", 16, "Florida", 16),
    ("Emily", 30, "New York", 30),
    ("Olivia", 40, "Illinois", 40),
    ("Daniel", 30, "Washington", 30),
]

df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])

def get_duplicate_columns(df):
    """Return a list of column names that have duplicate content."""
    duplicate_columns = set()

    for i in range(df.shape[1]):
        col_i = df.iloc[:, i]
        for j in range(i + 1, df.shape[1]):
            col_j = df.iloc[:, j]
            if col_i.equals(col_j):
                duplicate_columns.add(df.columns[j])

    return list(duplicate_columns)

# Find duplicates
duplicates = get_duplicate_columns(df)
print(f"Duplicate columns: {duplicates}")

Output:

Duplicate columns: ['Marks']

The function correctly identifies that Marks is a content-duplicate of Age.

Detecting Duplicate Column Names

If your DataFrame has columns with the same name (common after merges), use df.columns.duplicated():

import pandas as pd

students = [
    ("Michael", 34, "California", 34),
    ("Emily", 30, "Texas", 30),
    ("Jacob", 16, "Florida", 16),
    ("Emily", 30, "New York", 30),
    ("Olivia", 40, "Illinois", 40),
    ("Daniel", 30, "Washington", 30),
]

df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])

# Create a DataFrame with duplicate column names
df_dup_names = pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]],
    columns=["A", "B", "A"]
)

print("Duplicate column names:")
print(df_dup_names.columns[df_dup_names.columns.duplicated()].tolist())

Output:

Duplicate column names:
['A']

Method 1: Drop Duplicate Content Using Transpose and `drop_duplicates()`

A concise approach is to transpose the DataFrame (swap rows and columns), apply drop_duplicates() to remove duplicate rows (which were originally duplicate columns), then transpose back:

import pandas as pd

students = [
    ("Michael", 34, "California", 34),
    ("Emily", 30, "Texas", 30),
    ("Jacob", 16, "Florida", 16),
    ("Emily", 30, "New York", 30),
    ("Olivia", 40, "Illinois", 40),
    ("Daniel", 30, "Washington", 30),
]

df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])


df_clean = df.T.drop_duplicates().T
print(df_clean)

Output:

      Name Age    Domicile
Michael  34  California
  Emily  30       Texas
  Jacob  16     Florida
  Emily  30    New York
 Olivia  40    Illinois
 Daniel  30  Washington

The Marks column has been removed because its content is identical to Age.

warning

This method has two potential issues:

Data type changes - transposing converts all values to a common dtype (usually object), which may alter numeric columns. Verify dtypes after the operation:
```
print(df_clean.dtypes)
```
Performance - transposing large DataFrames is memory-intensive and slow. For large datasets, use one of the other methods below.

Method 2: Drop Duplicate Column Names Using `df.loc[]`

If your duplicate columns share the same name, use df.columns.duplicated() with loc[] to keep only the first occurrence:

import pandas as pd

# Create a DataFrame with duplicate column names
df_dup = pd.DataFrame(
    {"A": [1, 2, 3], "B": [4, 5, 6], "A_dup": [1, 2, 3]},
)
df_dup.columns = ["A", "B", "A"]  # Force duplicate names

print("Before:")
print(df_dup)

# Keep only the first occurrence of each column name
df_clean = df_dup.loc[:, ~df_dup.columns.duplicated()]

print("\nAfter:")
print(df_clean)

Output:

df.columns.duplicated() returns a boolean array - True for column names that have appeared before.
The ~ operator negates it, selecting only the first occurrence of each name.
df.loc[:, mask] selects all rows and only the non-duplicate columns.

tip

To keep the last occurrence instead of the first, pass keep='last':

df_clean = df_dup.loc[:, ~df_dup.columns.duplicated(keep="last")]

Method 3: Drop Using `df.drop()` with a Custom Detection Function

Combine the duplicate detection function from earlier with DataFrame.drop() for a clear, explicit approach:

import pandas as pd

def get_duplicate_columns(df):
    """Return column names whose content duplicates an earlier column."""
    duplicate_columns = set()
    for i in range(df.shape[1]):
        for j in range(i + 1, df.shape[1]):
            if df.iloc[:, i].equals(df.iloc[:, j]):
                duplicate_columns.add(df.columns[j])
    return list(duplicate_columns)

# Original DataFrame
students = [
    ("Michael", 34, "California", 34),
    ("Emily", 30, "Texas", 30),
    ("Jacob", 16, "Florida", 16),
    ("Emily", 30, "New York", 30),
    ("Olivia", 40, "Illinois", 40),
    ("Daniel", 30, "Washington", 30),
]
df = pd.DataFrame(students, columns=["Name", "Age", "Domicile", "Marks"])

# Find and drop duplicate columns
duplicate_cols = get_duplicate_columns(df)
print(f"Dropping columns: {duplicate_cols}")

df_clean = df.drop(columns=duplicate_cols)
print(df_clean)

Output:

Dropping columns: ['Marks']
      Name  Age    Domicile
Michael   34  California
  Emily   30       Texas
  Jacob   16     Florida
  Emily   30    New York
 Olivia   40    Illinois
 Daniel   30  Washington

This approach preserves the original data types and gives you full control over which columns are removed.

Method 4: Using Column Correlation for Near-Duplicates

Sometimes columns aren't exactly identical but are highly correlated (e.g., due to rounding differences). You can use correlation to detect near-duplicates:

import numpy as np
import pandas as pd

df_numeric = pd.DataFrame({
    "A": [1.0, 2.0, 3.0, 4.0, 5.0],
    "B": [10, 20, 30, 40, 50],
    "C": [1.001, 2.001, 3.001, 4.001, 5.001],  # Near-duplicate of A
})

# Compute correlation matrix
corr_matrix = df_numeric.corr().abs()

# Select upper triangle (excluding diagonal)
upper_triangle = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

# Find columns with correlation > 0.99
threshold = 0.99
to_drop = [col for col in upper_triangle.columns if any(upper_triangle[col] > threshold)]

print(f"Near-duplicate columns to drop: {to_drop}")
df_clean = df_numeric.drop(columns=to_drop)
print(df_clean)

Output:

Near-duplicate columns to drop: ['B', 'C']
     A
1.0
2.0
3.0
4.0
5.0

note

This method only works with numeric columns and detects linear relationships. Adjust the correlation threshold based on your tolerance for similarity.

Common Mistake: Using `inplace=True` with Chained Operations

Dropping columns with inplace=True on a filtered DataFrame can lead to unexpected behavior:

# ❌ Potentially problematic: modifies the original DataFrame
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)

While this works in simple cases, chained operations with inplace=True can cause SettingWithCopyWarning. The safer approach is to assign the result to a new variable:

# ✅ Safer: creates a clean copy
df_clean = df.drop(columns=duplicate_cols)

Comparison of Methods

Method	Detects Content Duplicates	Detects Name Duplicates	Preserves Dtypes	Performance
Transpose + `drop_duplicates()`	✅	✅	❌ (may change)	⚠️ Slow for large data
`df.loc[:, ~duplicated()]`	❌	✅	✅	✅ Fast
Custom function + `df.drop()`	✅	✅	✅	⚠️ O(n²) columns
Correlation-based detection	✅ (near-duplicates)	❌	✅	✅ Fast (numeric only)

Conclusion

Duplicate columns waste memory, cause confusion, and can lead to errors in analysis. Pandas provides several ways to find and remove them:

For duplicate column names, use df.loc[:, ~df.columns.duplicated()] - it's fast, simple, and preserves data types.
For duplicate column content (different names, same data), use a custom comparison function with df.drop() for the most reliable results.
For a quick one-liner, transpose with df.T.drop_duplicates().T works but may alter data types.
For near-duplicates in numeric data, correlation analysis with a threshold can catch columns that are almost identical.

Choose the method that matches your specific scenario - whether you're dealing with duplicate names from a merge operation or redundant data columns in a large dataset.

Creating a Sample DataFrame​

Finding Duplicate Columns​

Detecting Columns with Identical Content​

Detecting Duplicate Column Names​

Method 1: Drop Duplicate Content Using Transpose and drop_duplicates()​

Method 2: Drop Duplicate Column Names Using df.loc[]​

Method 3: Drop Using df.drop() with a Custom Detection Function​

Method 4: Using Column Correlation for Near-Duplicates​

Common Mistake: Using inplace=True with Chained Operations​

Comparison of Methods​

Conclusion​

Table of Contents

Creating a Sample DataFrame

Finding Duplicate Columns

Detecting Columns with Identical Content

Detecting Duplicate Column Names

Method 1: Drop Duplicate Content Using Transpose and `drop_duplicates()`

Method 2: Drop Duplicate Column Names Using `df.loc[]`

Method 3: Drop Using `df.drop()` with a Custom Detection Function

Method 4: Using Column Correlation for Near-Duplicates

Common Mistake: Using `inplace=True` with Chained Operations

Comparison of Methods

Conclusion