Python Pandas: How to Concatenate Pandas DataFrames Without Duplicates
Combining multiple datasets while eliminating redundant rows is a fundamental operation in data pipelines. Whether you are merging daily exports, consolidating reports from different sources, or applying incremental updates to a master dataset, you need to ensure the final result contains only unique records. Pandas provides a straightforward two-step approach: concatenate the DataFrames first, then remove the duplicates.
In this guide, you will learn how to stack DataFrames and deduplicate them based on exact row matches or specific key columns, understand the different deduplication strategies, and avoid common performance pitfalls.
Basic Concatenation with Deduplication
The standard pattern combines pd.concat() to stack the DataFrames and .drop_duplicates() to remove redundant rows:
import pandas as pd
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [2, 3], 'Name': ['Bob', 'Charlie']})
# Stack and remove exact duplicate rows
combined = pd.concat([df1, df2], ignore_index=True).drop_duplicates()
print(combined)
Output:
ID Name
0 1 Alice
1 2 Bob
3 3 Charlie
Bob appears in both DataFrames, but the duplicate row is removed. Only unique rows remain in the result.
Why ignore_index=True Matters
Without ignore_index=True, the concatenated DataFrame retains the original index values from each source, which can lead to duplicate index labels:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
# Without ignore_index: duplicate index labels
print("Without ignore_index:")
print(pd.concat([df1, df2]))
print()
# With ignore_index: clean sequential index
print("With ignore_index:")
print(pd.concat([df1, df2], ignore_index=True))
Output:
Without ignore_index:
A
0 1
1 2
0 3
1 4
With ignore_index:
A
0 1
1 2
2 3
3 4
Duplicate indices can cause confusion and unexpected behavior in subsequent operations like .loc[] lookups. Using ignore_index=True produces a clean, sequential index.
Key-Based Deduplication with subset
Often, two rows are considered duplicates not because every column matches, but because they share the same key value, such as an ID. The subset parameter lets you specify which columns define uniqueness:
import pandas as pd
df_old = pd.DataFrame({
'ID': [1, 2, 3],
'Status': ['Active', 'Active', 'Active']
})
df_new = pd.DataFrame({
'ID': [2, 4],
'Status': ['Inactive', 'Active'] # ID 2 has been updated
})
# Keep the latest version of each ID
final = pd.concat([df_old, df_new]).drop_duplicates(
subset='ID',
keep='last',
ignore_index=True
)
print(final)
Output:
ID Status
0 1 Active
1 3 Active
2 2 Inactive
3 4 Active
ID 2 has been updated from "Active" to "Inactive" because keep='last' preserves the row from df_new (which comes after df_old in the concatenation). ID 4 has been inserted as a new record.
When using keep='last' for upsert operations, the order of concatenation is critical. The DataFrame containing the updates must come after the original data in the pd.concat() call. Reversing the order would preserve the old values instead.
Understanding the keep Parameter
The keep parameter controls which duplicate row is retained:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 2, 3],
'Value': ['A', 'B', 'C', 'D']
})
print("keep='first' (preserve original):")
print(df.drop_duplicates(subset='ID', keep='first'))
print()
print("keep='last' (apply updates):")
print(df.drop_duplicates(subset='ID', keep='last'))
print()
print("keep=False (remove all conflicts):")
print(df.drop_duplicates(subset='ID', keep=False))
Output:
keep='first' (preserve original):
ID Value
0 1 A
1 2 B
3 3 D
keep='last' (apply updates):
ID Value
0 1 A
2 2 C
3 3 D
keep=False (remove all conflicts):
ID Value
0 1 A
3 3 D
| Value | Behavior | Typical Use Case |
|---|---|---|
'first' | Keeps the first occurrence | Preserve original data |
'last' | Keeps the last occurrence | Apply updates (upsert) |
False | Drops all duplicate rows | Remove any conflicting records |
Composite Key Deduplication
When a single column is not sufficient to define uniqueness, pass a list of columns to subset:
import pandas as pd
df1 = pd.DataFrame({
'Year': [2023, 2023],
'Month': [1, 2],
'Sales': [100, 150]
})
df2 = pd.DataFrame({
'Year': [2023, 2023],
'Month': [2, 3],
'Sales': [175, 200] # Updated Feb, new Mar
})
final = pd.concat([df1, df2]).drop_duplicates(
subset=['Year', 'Month'],
keep='last',
ignore_index=True
)
print(final)
Output:
Year Month Sales
0 2023 1 100
1 2023 2 175
2 2023 3 200
February 2023 has been updated from 150 to 175, and March 2023 has been inserted as a new record. A row is only considered a duplicate if both Year and Month match.
Combining Multiple Files
A common real-world scenario is loading and deduplicating multiple CSV files:
import pandas as pd
import glob
# Find all CSV files in the data directory
files = glob.glob('data/*.csv')
# Load all files into a list
all_dfs = [pd.read_csv(f) for f in files]
# Concatenate once and deduplicate
combined = pd.concat(all_dfs, ignore_index=True)
final = combined.drop_duplicates(subset='ID', keep='last')
print(f"Combined {len(files)} files: {len(combined)} rows -> {len(final)} unique")
Performance: Batch vs. Loop Concatenation
A critical performance consideration is where you call pd.concat(). Concatenating inside a loop creates a new DataFrame on every iteration, resulting in quadratic time complexity:
import pandas as pd
file_list = ['file1.csv', 'file2.csv', 'file3.csv']
# Wrong: concatenating inside the loop (very slow)
result = pd.DataFrame()
for file in file_list:
df = pd.read_csv(file)
result = pd.concat([result, df])
# Correct: collect first, concatenate once (fast)
dfs = [pd.read_csv(f) for f in file_list]
result = pd.concat(dfs, ignore_index=True).drop_duplicates()
Always collect DataFrames in a list first and call pd.concat() once at the end. This approach is typically 50 to 100 times faster than incremental concatenation inside a loop, and the performance gap widens as the number of files grows.
Quick Reference
| Task | Code |
|---|---|
| Simple stack | pd.concat([df1, df2], ignore_index=True) |
| Remove exact duplicates | .drop_duplicates() |
| Key-based deduplication | .drop_duplicates(subset='ID') |
| Keep updates (upsert) | .drop_duplicates(subset='ID', keep='last') |
| Multi-column key | .drop_duplicates(subset=['col1', 'col2']) |
| Batch file loading | pd.concat([pd.read_csv(f) for f in files]) |
- Combine
pd.concat()with.drop_duplicates()for clean data merging. - Use
ignore_index=Trueto prevent duplicate indices. - Specify
subsetfor key-based deduplication andkeep='last'when newer data should override older records. - Always collect DataFrames in a list first and concatenate once for optimal performance.