Skip to main content

Python Pandas: How to Concatenate Pandas DataFrames Without Duplicates

Combining multiple datasets while eliminating redundant rows is a fundamental operation in data pipelines. Whether you are merging daily exports, consolidating reports from different sources, or applying incremental updates to a master dataset, you need to ensure the final result contains only unique records. Pandas provides a straightforward two-step approach: concatenate the DataFrames first, then remove the duplicates.

In this guide, you will learn how to stack DataFrames and deduplicate them based on exact row matches or specific key columns, understand the different deduplication strategies, and avoid common performance pitfalls.

Basic Concatenation with Deduplication

The standard pattern combines pd.concat() to stack the DataFrames and .drop_duplicates() to remove redundant rows:

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [2, 3], 'Name': ['Bob', 'Charlie']})

# Stack and remove exact duplicate rows
combined = pd.concat([df1, df2], ignore_index=True).drop_duplicates()

print(combined)

Output:

   ID     Name
0 1 Alice
1 2 Bob
3 3 Charlie

Bob appears in both DataFrames, but the duplicate row is removed. Only unique rows remain in the result.

Why ignore_index=True Matters

Without ignore_index=True, the concatenated DataFrame retains the original index values from each source, which can lead to duplicate index labels:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})

# Without ignore_index: duplicate index labels
print("Without ignore_index:")
print(pd.concat([df1, df2]))
print()

# With ignore_index: clean sequential index
print("With ignore_index:")
print(pd.concat([df1, df2], ignore_index=True))

Output:

Without ignore_index:
A
0 1
1 2
0 3
1 4

With ignore_index:
A
0 1
1 2
2 3
3 4

Duplicate indices can cause confusion and unexpected behavior in subsequent operations like .loc[] lookups. Using ignore_index=True produces a clean, sequential index.

Key-Based Deduplication with subset

Often, two rows are considered duplicates not because every column matches, but because they share the same key value, such as an ID. The subset parameter lets you specify which columns define uniqueness:

import pandas as pd

df_old = pd.DataFrame({
'ID': [1, 2, 3],
'Status': ['Active', 'Active', 'Active']
})

df_new = pd.DataFrame({
'ID': [2, 4],
'Status': ['Inactive', 'Active'] # ID 2 has been updated
})

# Keep the latest version of each ID
final = pd.concat([df_old, df_new]).drop_duplicates(
subset='ID',
keep='last',
ignore_index=True
)

print(final)

Output:

   ID    Status
0 1 Active
1 3 Active
2 2 Inactive
3 4 Active

ID 2 has been updated from "Active" to "Inactive" because keep='last' preserves the row from df_new (which comes after df_old in the concatenation). ID 4 has been inserted as a new record.

warning

When using keep='last' for upsert operations, the order of concatenation is critical. The DataFrame containing the updates must come after the original data in the pd.concat() call. Reversing the order would preserve the old values instead.

Understanding the keep Parameter

The keep parameter controls which duplicate row is retained:

import pandas as pd

df = pd.DataFrame({
'ID': [1, 2, 2, 3],
'Value': ['A', 'B', 'C', 'D']
})

print("keep='first' (preserve original):")
print(df.drop_duplicates(subset='ID', keep='first'))
print()

print("keep='last' (apply updates):")
print(df.drop_duplicates(subset='ID', keep='last'))
print()

print("keep=False (remove all conflicts):")
print(df.drop_duplicates(subset='ID', keep=False))

Output:

keep='first' (preserve original):
ID Value
0 1 A
1 2 B
3 3 D

keep='last' (apply updates):
ID Value
0 1 A
2 2 C
3 3 D

keep=False (remove all conflicts):
ID Value
0 1 A
3 3 D
ValueBehaviorTypical Use Case
'first'Keeps the first occurrencePreserve original data
'last'Keeps the last occurrenceApply updates (upsert)
FalseDrops all duplicate rowsRemove any conflicting records

Composite Key Deduplication

When a single column is not sufficient to define uniqueness, pass a list of columns to subset:

import pandas as pd

df1 = pd.DataFrame({
'Year': [2023, 2023],
'Month': [1, 2],
'Sales': [100, 150]
})

df2 = pd.DataFrame({
'Year': [2023, 2023],
'Month': [2, 3],
'Sales': [175, 200] # Updated Feb, new Mar
})

final = pd.concat([df1, df2]).drop_duplicates(
subset=['Year', 'Month'],
keep='last',
ignore_index=True
)

print(final)

Output:

   Year  Month  Sales
0 2023 1 100
1 2023 2 175
2 2023 3 200

February 2023 has been updated from 150 to 175, and March 2023 has been inserted as a new record. A row is only considered a duplicate if both Year and Month match.

Combining Multiple Files

A common real-world scenario is loading and deduplicating multiple CSV files:

import pandas as pd
import glob

# Find all CSV files in the data directory
files = glob.glob('data/*.csv')

# Load all files into a list
all_dfs = [pd.read_csv(f) for f in files]

# Concatenate once and deduplicate
combined = pd.concat(all_dfs, ignore_index=True)
final = combined.drop_duplicates(subset='ID', keep='last')

print(f"Combined {len(files)} files: {len(combined)} rows -> {len(final)} unique")

Performance: Batch vs. Loop Concatenation

A critical performance consideration is where you call pd.concat(). Concatenating inside a loop creates a new DataFrame on every iteration, resulting in quadratic time complexity:

import pandas as pd

file_list = ['file1.csv', 'file2.csv', 'file3.csv']

# Wrong: concatenating inside the loop (very slow)
result = pd.DataFrame()
for file in file_list:
df = pd.read_csv(file)
result = pd.concat([result, df])

# Correct: collect first, concatenate once (fast)
dfs = [pd.read_csv(f) for f in file_list]
result = pd.concat(dfs, ignore_index=True).drop_duplicates()
tip

Always collect DataFrames in a list first and call pd.concat() once at the end. This approach is typically 50 to 100 times faster than incremental concatenation inside a loop, and the performance gap widens as the number of files grows.

Quick Reference

TaskCode
Simple stackpd.concat([df1, df2], ignore_index=True)
Remove exact duplicates.drop_duplicates()
Key-based deduplication.drop_duplicates(subset='ID')
Keep updates (upsert).drop_duplicates(subset='ID', keep='last')
Multi-column key.drop_duplicates(subset=['col1', 'col2'])
Batch file loadingpd.concat([pd.read_csv(f) for f in files])
  • Combine pd.concat() with .drop_duplicates() for clean data merging.
  • Use ignore_index=True to prevent duplicate indices.
  • Specify subset for key-based deduplication and keep='last' when newer data should override older records.
  • Always collect DataFrames in a list first and concatenate once for optimal performance.