Python Pandas: How to Merge Multiple CSV Files into a Single Pandas DataFrame in Python
When working with data analysis projects, it's common to receive data split across multiple CSV files - whether by date, region, department, or any other logical partition. Before you can analyze this data, you need to combine all these files into a single Pandas DataFrame. Manually copying and pasting data is error-prone and impractical for large datasets, so Python and Pandas offer efficient, automated solutions.
In this guide, you'll learn how to merge multiple CSV files using pd.concat(), automate file discovery with glob, and handle common issues like mismatched columns and duplicate headers.
CSV files for following examples
OrderID,Product,Quantity,Price
1001,Widget,5,29.99
1002,Gadget,3,49.99
1003,Widget,7,29.99
1004,Gizmo,2,19.99
OrderID,Product,Quantity,Price
1005,Widget,1,29.99
1006,Gadget,4,49.99
1007,Gizmo,3,19.99
1008,Widget,6,29.99
OrderID,Product,Quantity,Price
1009,Widget,2,29.99
1010,Gadget,5,49.99
1011,Gizmo,1,19.99
1012,Widget,3,29.99
ID,Name
1,Alice
2,Bob
3,Charlie
4,David
ID,Score
1,85
2,92
3,78
4,88
Method 1: Merging Specific CSV Files by Name
The simplest approach is to explicitly list the CSV files you want to merge and use pd.concat() combined with map() to read and concatenate them in one step.
How It Works
pd.read_csv()reads each CSV file into a DataFrame.map()appliespd.read_csv()to every file path in the list.pd.concat()stitches the resulting DataFrames together along the row axis.
import pandas as pd
# List the CSV files to merge
csv_files = ['sales_jan.csv', 'sales_feb.csv']
# Read and concatenate all files
df = pd.concat(map(pd.read_csv, csv_files), ignore_index=True)
print(df)
Output (example):
OrderID Product Quantity Price
0 1001 Widget 5 29.99
1 1002 Gadget 3 49.99
2 1003 Widget 7 29.99
3 1004 Gizmo 2 19.99
4 1005 Widget 1 29.99
5 1006 Gadget 4 49.99
6 1007 Gizmo 3 19.99
7 1008 Widget 6 29.99
Setting ignore_index=True resets the index to a continuous sequence (0, 1, 2, ...). Without it, each file's original index is preserved, which can lead to duplicate index values.
Step-by-Step Alternative
If you prefer more explicit, readable code, you can read the files individually and then concatenate:
import pandas as pd
df1 = pd.read_csv('sales_jan.csv')
df2 = pd.read_csv('sales_feb.csv')
df3 = pd.read_csv('sales_mar.csv')
df = pd.concat([df1, df2, df3], ignore_index=True)
print(f"Total rows: {len(df)}")
print(df.head())
Output:
Total rows: 12
OrderID Product Quantity Price
0 1001 Widget 5 29.99
1 1002 Gadget 3 49.99
2 1003 Widget 7 29.99
3 1004 Gizmo 2 19.99
4 1005 Widget 1 29.99
This approach is fine for a small number of files, but becomes cumbersome when you have dozens or hundreds of CSVs.
Method 2: Automatically Discovering and Merging All CSV Files
When you have many CSV files in a directory (or don't know the exact filenames), use the glob and os modules to automatically find matching files and merge them.
How It Works
os.path.join()builds a file path pattern by combining the directory and a wildcard filename.glob.glob()returns a list of all files matching the pattern.pd.concat()withmap()reads and merges them all.
import pandas as pd
import glob
import os
# Define the directory and file pattern
file_pattern = os.path.join("data", "sales_*.csv")
# Find all matching CSV files
csv_files = sorted(glob.glob(file_pattern))
print(f"Found {len(csv_files)} files: {csv_files}")
# Read and concatenate all files
df = pd.concat(map(pd.read_csv, csv_files), ignore_index=True)
print(df)
Output (example):
Found 3 files: ['sales_feb.csv', 'sales_jan.csv', 'sales_mar.csv']
OrderID Product Quantity Price
0 1005 Widget 1 29.99
1 1006 Gadget 4 49.99
2 1007 Gizmo 3 19.99
3 1008 Widget 6 29.99
4 1001 Widget 5 29.99
5 1002 Gadget 3 49.99
6 1003 Widget 7 29.99
7 1004 Gizmo 2 19.99
8 1009 Widget 2 29.99
9 1010 Gadget 5 49.99
10 1011 Gizmo 1 19.99
11 1012 Widget 3 29.99
Use sorted() on the glob.glob() result to ensure files are processed in a consistent, predictable order. Without sorting, the order depends on the operating system and filesystem, which can vary.
This entire operation can also be condensed into a single line:
df = pd.concat(map(pd.read_csv, sorted(glob.glob(os.path.join("data", "sales_*.csv")))), ignore_index=True)
Method 3: Adding a Source Column to Track File Origin
When merging many files, it's often useful to know which file each row came from. You can add a source column during the merge:
import pandas as pd
import glob
import os
csv_files = sorted(glob.glob(os.path.join("data", "sales_*.csv")))
dataframes = []
for filepath in csv_files:
df_temp = pd.read_csv(filepath)
df_temp['source_file'] = os.path.basename(filepath)
dataframes.append(df_temp)
df = pd.concat(dataframes, ignore_index=True)
print(df)
Output (example):
OrderID Product Quantity Price source_file
0 1005 Widget 1 29.99 sales_feb.csv
1 1006 Gadget 4 49.99 sales_feb.csv
2 1007 Gizmo 3 19.99 sales_feb.csv
3 1008 Widget 6 29.99 sales_feb.csv
4 1001 Widget 5 29.99 sales_jan.csv
5 1002 Gadget 3 49.99 sales_jan.csv
6 1003 Widget 7 29.99 sales_jan.csv
7 1004 Gizmo 2 19.99 sales_jan.csv
8 1009 Widget 2 29.99 sales_mar.csv
9 1010 Gadget 5 49.99 sales_mar.csv
10 1011 Gizmo 1 19.99 sales_mar.csv
11 1012 Widget 3 29.99 sales_mar.csv
This makes it easy to filter or group by the original file later in your analysis.
Handling Common Issues
Mismatched Columns Across Files
If your CSV files have different columns, pd.concat() will still merge them but fill missing columns with NaN:
import pandas as pd
df1 = pd.DataFrame({'Name': ['Alice'], 'Age': [25]})
df2 = pd.DataFrame({'Name': ['Bob'], 'Score': [88]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)
Output:
Name Age Score
0 Alice 25.0 NaN
1 Bob NaN 88.0
This silent behavior can mask data quality issues. If all your CSV files should have the same columns, validate them before merging:
expected_columns = {'OrderID', 'Product', 'Quantity', 'Price'}
for filepath in csv_files:
df_temp = pd.read_csv(filepath)
if set(df_temp.columns) != expected_columns:
print(f"Warning: {filepath} has unexpected columns: {set(df_temp.columns)}")
Duplicate Headers Inside CSV Files
Sometimes CSV files exported from tools contain header rows repeated within the data. After merging, these show up as data rows with column names as values:
import pandas as pd
# ❌ Header rows mixed into data after concat
df = pd.concat(map(pd.read_csv, ['file1.csv', 'file2.csv']), ignore_index=True)
print(df)
Output (problematic):
OrderID Product Quantity Price
0 1001 Widget 5 29.99
1 OrderID Product Quantity Price ← duplicate header row
2 1002 Gadget 3 49.99
Fix: Filter out rows that match the header values:
# ✅ Remove rows where values match column names
df = df[df['OrderID'] != 'OrderID']
# Reset data types after filtering
df['OrderID'] = df['OrderID'].astype(int)
df['Price'] = df['Price'].astype(float)
print(df)
Merging Large Files Efficiently
For very large CSV files, reading all of them into memory at once can be problematic. Use chunked reading or specify data types upfront to reduce memory usage:
import pandas as pd
import glob
csv_files = sorted(glob.glob("data/sales_*.csv"))
# Specify data types to reduce memory consumption
dtypes = {'OrderID': 'int32', 'Quantity': 'int16', 'Price': 'float32', 'Product': 'category'}
dataframes = [pd.read_csv(f, dtype=dtypes) for f in csv_files]
df = pd.concat(dataframes, ignore_index=True)
print(f"Total rows: {len(df)}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB")
Output:
Total rows: 12000
Memory usage: 5.00 MB
Row-wise vs Column-wise Concatenation
By default, pd.concat() stacks DataFrames vertically (row-wise, axis=0). If your CSV files contain different columns for the same records and you want to merge them horizontally, use axis=1:
import pandas as pd
df_names = pd.read_csv('names.csv') # Contains: ID, Name
df_scores = pd.read_csv('scores.csv') # Contains: ID, Score
# Horizontal concatenation (use merge for key-based joining)
result = pd.concat([df_names, df_scores], axis=1)
print(result)
Output:
ID Name ID Score
0 1 Alice 1 85
1 2 Bob 2 92
2 3 Charlie 3 78
3 4 David 4 88
For horizontal merging based on a shared key column, use pd.merge() instead of pd.concat(axis=1). The concat function aligns by index position, not by logical keys, which can produce incorrect pairings if rows aren't in the same order.
Summary
| Method | Best For | Key Function |
|---|---|---|
Explicit file list with map() | Small number of known files | pd.concat(map(pd.read_csv, files)) |
glob pattern matching | Many files or dynamic discovery | glob.glob(os.path.join(dir, "*.csv")) |
| Loop with source tracking | Tracing data origin | Manual loop with source_file column |
To merge multiple CSV files into a single Pandas DataFrame:
- use
pd.concat()to stack them vertically - use
glob.glob()to automatically discover files - use
ignore_index=Trueto maintain a clean index.
Always validate that your files share the same column structure, handle duplicate headers, and consider memory-efficient data types for large datasets.