Python Pandas: How to Concatenate and Merge CSV Files with Upsert in Pandas
Combining CSV files while updating existing records and inserting new ones is a common data integration pattern known as an "upsert" (update + insert). This approach is essential when you receive periodic data updates and need to merge them into a master dataset without creating duplicates, while ensuring that changed records reflect their latest values.
In this guide, you will learn how to implement the upsert pattern in Pandas, batch-concatenate multiple CSV files, handle composite keys and type mismatches, and track which records were modified during the process.
The Upsert Pattern: Update Existing, Insert New
The core idea is simple: concatenate the master data with the updates (placing updates last), then drop duplicates based on a key column while keeping the last occurrence. Since the updates come after the original data in the concatenated result, keep='last' ensures the newer version of each record is preserved.
import pandas as pd
master = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'status': ['Active', 'Active', 'Active']
})
updates = pd.DataFrame({
'id': [2, 4],
'name': ['Bob', 'Diana'],
'status': ['Inactive', 'Active'] # Bob's status changed
})
# Step 1: Concatenate with updates LAST (order matters!)
combined = pd.concat([master, updates])
print("After concat:")
print(combined)
print()
# Step 2: Drop duplicates, keeping the last occurrence (the update)
final = combined.drop_duplicates(subset='id', keep='last', ignore_index=True)
print("After upsert:")
print(final)
Output:
After concat:
id name status
0 1 Alice Active
1 2 Bob Active
2 3 Charlie Active
0 2 Bob Inactive
1 4 Diana Active
After upsert:
id name status
0 1 Alice Active
1 3 Charlie Active
2 2 Bob Inactive
3 4 Diana Active
Bob's record has been updated from "Active" to "Inactive", and Diana has been inserted as a new record. The original records for Alice and Charlie remain unchanged.
The order of concatenation is critical. The updates DataFrame must come after the master DataFrame in the pd.concat() call. If you reverse the order, keep='last' will preserve the old master data instead of the updates.
Applying the Pattern to CSV Files
In practice, you will be reading from actual CSV files:
import pandas as pd
# Load files
master = pd.read_csv('master.csv')
updates = pd.read_csv('updates.csv')
# Upsert
combined = pd.concat([master, updates])
final = combined.drop_duplicates(subset='id', keep='last', ignore_index=True)
# Save the result back
final.to_csv('master.csv', index=False)
Batch Concatenation of Multiple CSV Files
When you need to combine many CSV files that share the same structure, such as monthly sales reports or daily log files, use glob to find the files and concatenate them in a single operation:
import pandas as pd
import glob
# Find all matching files
files = glob.glob('data/sales_*.csv')
# Concatenate all files at once
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
print(f"Combined {len(files)} files into {len(df)} total rows")
With Progress Tracking
For large numbers of files, tracking progress helps you monitor the operation:
import pandas as pd
import glob
files = glob.glob('data/*.csv')
dfs = []
for f in files:
print(f"Processing {f}...")
dfs.append(pd.read_csv(f))
combined = pd.concat(dfs, ignore_index=True)
print(f"Done. Combined {len(files)} files into {len(combined)} rows.")
Always collect DataFrames in a list first and call pd.concat() once at the end. Concatenating inside the loop creates a new DataFrame on every iteration, which is dramatically slower due to repeated memory allocation and copying.
Handling Type Mismatches
A common source of bugs in CSV merging is when the key column has different data types across files. One file might store IDs as integers while another stores them as strings, causing the duplicate detection to fail silently:
import pandas as pd
master = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
updates = pd.DataFrame({'id': ['2', '4'], 'value': [25, 40]})
# Without type alignment: duplicate detection fails
bad_result = pd.concat([master, updates])
bad_result = bad_result.drop_duplicates(subset='id', keep='last', ignore_index=True)
print("Without type alignment (id=2 appears twice):")
print(bad_result)
print()
# With type alignment: works correctly
master['id'] = master['id'].astype(str)
updates['id'] = updates['id'].astype(str)
good_result = pd.concat([master, updates])
good_result = good_result.drop_duplicates(subset='id', keep='last', ignore_index=True)
print("With type alignment:")
print(good_result)
Output:
Without type alignment (id=2 appears twice):
id value
0 1 10
1 2 20
2 3 30
3 2 25
4 4 40
With type alignment:
id value
0 1 10
1 3 30
2 2 25
3 4 40
Always ensure key columns have matching types before concatenating. Converting both to str is the safest approach when you are unsure.
Working with Composite Primary Keys
When a single column is not sufficient to uniquely identify records, pass a list of columns to the subset parameter:
import pandas as pd
master = pd.DataFrame({
'year': [2023, 2023, 2024],
'month': [1, 2, 1],
'sales': [100, 150, 200]
})
updates = pd.DataFrame({
'year': [2023, 2024],
'month': [2, 2],
'sales': [175, 250] # Updated Feb 2023, new Feb 2024
})
combined = pd.concat([master, updates])
final = combined.drop_duplicates(
subset=['year', 'month'],
keep='last',
ignore_index=True
)
print(final)
Output:
year month sales
0 2023 1 100
1 2024 1 200
2 2023 2 175
3 2024 2 250
February 2023 sales have been updated from 150 to 175, and February 2024 has been inserted as a new record.
Preserving Sort Order After Upsert
The upsert operation can change the row order because drop_duplicates removes earlier occurrences while keeping later ones. To maintain a consistent ordering, sort by the key column after the operation:
import pandas as pd
master = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
})
updates = pd.DataFrame({
'id': [2, 4],
'name': ['Bob Updated', 'Diana'],
})
combined = pd.concat([master, updates])
final = combined.drop_duplicates(subset='id', keep='last', ignore_index=True)
# Sort by primary key for consistent ordering
final = final.sort_values('id').reset_index(drop=True)
print(final)
Output:
id name
0 1 Alice
1 2 Bob Updated
2 3 Charlie
3 4 Diana
Tracking Which Records Changed
To identify which records were updated and which were inserted during the upsert, add a temporary source column before merging:
import pandas as pd
master = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'status': ['Active', 'Active', 'Active']
})
updates = pd.DataFrame({
'id': [2, 4],
'name': ['Bob', 'Diana'],
'status': ['Inactive', 'Active']
})
# Mark the source of each record
master['_source'] = 'original'
updates['_source'] = 'update'
combined = pd.concat([master, updates])
final = combined.drop_duplicates(subset='id', keep='last', ignore_index=True)
# Report what changed
updated_or_new = final[final['_source'] == 'update']
print(f"Updated/inserted {len(updated_or_new)} records:")
print(updated_or_new[['id', 'name', 'status']])
print()
# Remove the tracking column before saving
final = final.drop(columns=['_source'])
final = final.sort_values('id').reset_index(drop=True)
print("Final result:")
print(final)
Output:
Updated/inserted 2 records:
id name status
2 2 Bob Inactive
3 4 Diana Active
Final result:
id name status
0 1 Alice Active
1 2 Bob Inactive
2 3 Charlie Active
3 4 Diana Active
Quick Reference
| Task | Method | Key Detail |
|---|---|---|
| Simple stack | pd.concat([df1, df2]) | ignore_index=True |
| Upsert | concat then drop_duplicates | keep='last', updates last |
| Composite key | drop_duplicates(subset=[col1, col2]) | List of key columns |
| Type alignment | .astype(str) on key columns | Match types before concat |
| Batch files | pd.concat([pd.read_csv(f) for f in files]) | Collect in list, concat once |
| Track changes | Add _source column before concat | Remove before saving |
Use pd.concat([master, updates]) followed by drop_duplicates(subset='key', keep='last') for upsert operations.
The order of concatenation is critical: always place the updates DataFrame last so that keep='last' preserves the newer data.
For batch operations, use glob.glob() to discover files and list comprehension with pd.concat() for efficient merging.