Skip to main content

Python Pandas: How to Concatenate and Merge CSV Files with Upsert in Pandas

Combining CSV files while updating existing records and inserting new ones is a common data integration pattern known as an "upsert" (update + insert). This approach is essential when you receive periodic data updates and need to merge them into a master dataset without creating duplicates, while ensuring that changed records reflect their latest values.

In this guide, you will learn how to implement the upsert pattern in Pandas, batch-concatenate multiple CSV files, handle composite keys and type mismatches, and track which records were modified during the process.

The Upsert Pattern: Update Existing, Insert New

The core idea is simple: concatenate the master data with the updates (placing updates last), then drop duplicates based on a key column while keeping the last occurrence. Since the updates come after the original data in the concatenated result, keep='last' ensures the newer version of each record is preserved.

import pandas as pd

master = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'status': ['Active', 'Active', 'Active']
})

updates = pd.DataFrame({
'id': [2, 4],
'name': ['Bob', 'Diana'],
'status': ['Inactive', 'Active'] # Bob's status changed
})

# Step 1: Concatenate with updates LAST (order matters!)
combined = pd.concat([master, updates])
print("After concat:")
print(combined)
print()

# Step 2: Drop duplicates, keeping the last occurrence (the update)
final = combined.drop_duplicates(subset='id', keep='last', ignore_index=True)
print("After upsert:")
print(final)

Output:

After concat:
id name status
0 1 Alice Active
1 2 Bob Active
2 3 Charlie Active
0 2 Bob Inactive
1 4 Diana Active

After upsert:
id name status
0 1 Alice Active
1 3 Charlie Active
2 2 Bob Inactive
3 4 Diana Active

Bob's record has been updated from "Active" to "Inactive", and Diana has been inserted as a new record. The original records for Alice and Charlie remain unchanged.

warning

The order of concatenation is critical. The updates DataFrame must come after the master DataFrame in the pd.concat() call. If you reverse the order, keep='last' will preserve the old master data instead of the updates.

Applying the Pattern to CSV Files

In practice, you will be reading from actual CSV files:

import pandas as pd

# Load files
master = pd.read_csv('master.csv')
updates = pd.read_csv('updates.csv')

# Upsert
combined = pd.concat([master, updates])
final = combined.drop_duplicates(subset='id', keep='last', ignore_index=True)

# Save the result back
final.to_csv('master.csv', index=False)

Batch Concatenation of Multiple CSV Files

When you need to combine many CSV files that share the same structure, such as monthly sales reports or daily log files, use glob to find the files and concatenate them in a single operation:

import pandas as pd
import glob

# Find all matching files
files = glob.glob('data/sales_*.csv')

# Concatenate all files at once
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

print(f"Combined {len(files)} files into {len(df)} total rows")

With Progress Tracking

For large numbers of files, tracking progress helps you monitor the operation:

import pandas as pd
import glob

files = glob.glob('data/*.csv')
dfs = []

for f in files:
print(f"Processing {f}...")
dfs.append(pd.read_csv(f))

combined = pd.concat(dfs, ignore_index=True)
print(f"Done. Combined {len(files)} files into {len(combined)} rows.")
tip

Always collect DataFrames in a list first and call pd.concat() once at the end. Concatenating inside the loop creates a new DataFrame on every iteration, which is dramatically slower due to repeated memory allocation and copying.

Handling Type Mismatches

A common source of bugs in CSV merging is when the key column has different data types across files. One file might store IDs as integers while another stores them as strings, causing the duplicate detection to fail silently:

import pandas as pd

master = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
updates = pd.DataFrame({'id': ['2', '4'], 'value': [25, 40]})

# Without type alignment: duplicate detection fails
bad_result = pd.concat([master, updates])
bad_result = bad_result.drop_duplicates(subset='id', keep='last', ignore_index=True)
print("Without type alignment (id=2 appears twice):")
print(bad_result)
print()

# With type alignment: works correctly
master['id'] = master['id'].astype(str)
updates['id'] = updates['id'].astype(str)

good_result = pd.concat([master, updates])
good_result = good_result.drop_duplicates(subset='id', keep='last', ignore_index=True)
print("With type alignment:")
print(good_result)

Output:

Without type alignment (id=2 appears twice):
id value
0 1 10
1 2 20
2 3 30
3 2 25
4 4 40

With type alignment:
id value
0 1 10
1 3 30
2 2 25
3 4 40

Always ensure key columns have matching types before concatenating. Converting both to str is the safest approach when you are unsure.

Working with Composite Primary Keys

When a single column is not sufficient to uniquely identify records, pass a list of columns to the subset parameter:

import pandas as pd

master = pd.DataFrame({
'year': [2023, 2023, 2024],
'month': [1, 2, 1],
'sales': [100, 150, 200]
})

updates = pd.DataFrame({
'year': [2023, 2024],
'month': [2, 2],
'sales': [175, 250] # Updated Feb 2023, new Feb 2024
})

combined = pd.concat([master, updates])
final = combined.drop_duplicates(
subset=['year', 'month'],
keep='last',
ignore_index=True
)

print(final)

Output:

   year  month  sales
0 2023 1 100
1 2024 1 200
2 2023 2 175
3 2024 2 250

February 2023 sales have been updated from 150 to 175, and February 2024 has been inserted as a new record.

Preserving Sort Order After Upsert

The upsert operation can change the row order because drop_duplicates removes earlier occurrences while keeping later ones. To maintain a consistent ordering, sort by the key column after the operation:

import pandas as pd

master = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
})

updates = pd.DataFrame({
'id': [2, 4],
'name': ['Bob Updated', 'Diana'],
})

combined = pd.concat([master, updates])
final = combined.drop_duplicates(subset='id', keep='last', ignore_index=True)

# Sort by primary key for consistent ordering
final = final.sort_values('id').reset_index(drop=True)

print(final)

Output:

   id         name
0 1 Alice
1 2 Bob Updated
2 3 Charlie
3 4 Diana

Tracking Which Records Changed

To identify which records were updated and which were inserted during the upsert, add a temporary source column before merging:

import pandas as pd

master = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'status': ['Active', 'Active', 'Active']
})

updates = pd.DataFrame({
'id': [2, 4],
'name': ['Bob', 'Diana'],
'status': ['Inactive', 'Active']
})

# Mark the source of each record
master['_source'] = 'original'
updates['_source'] = 'update'

combined = pd.concat([master, updates])
final = combined.drop_duplicates(subset='id', keep='last', ignore_index=True)

# Report what changed
updated_or_new = final[final['_source'] == 'update']
print(f"Updated/inserted {len(updated_or_new)} records:")
print(updated_or_new[['id', 'name', 'status']])
print()

# Remove the tracking column before saving
final = final.drop(columns=['_source'])
final = final.sort_values('id').reset_index(drop=True)

print("Final result:")
print(final)

Output:

Updated/inserted 2 records:
id name status
2 2 Bob Inactive
3 4 Diana Active

Final result:
id name status
0 1 Alice Active
1 2 Bob Inactive
2 3 Charlie Active
3 4 Diana Active

Quick Reference

TaskMethodKey Detail
Simple stackpd.concat([df1, df2])ignore_index=True
Upsertconcat then drop_duplicateskeep='last', updates last
Composite keydrop_duplicates(subset=[col1, col2])List of key columns
Type alignment.astype(str) on key columnsMatch types before concat
Batch filespd.concat([pd.read_csv(f) for f in files])Collect in list, concat once
Track changesAdd _source column before concatRemove before saving

Use pd.concat([master, updates]) followed by drop_duplicates(subset='key', keep='last') for upsert operations.

The order of concatenation is critical: always place the updates DataFrame last so that keep='last' preserves the newer data.

For batch operations, use glob.glob() to discover files and list comprehension with pd.concat() for efficient merging.