Skip to main content

Python Pandas: How to Merge Multiple CSV Files into a Single Pandas DataFrame in Python

When working with data analysis projects, it's common to receive data split across multiple CSV files - whether by date, region, department, or any other logical partition. Before you can analyze this data, you need to combine all these files into a single Pandas DataFrame. Manually copying and pasting data is error-prone and impractical for large datasets, so Python and Pandas offer efficient, automated solutions.

In this guide, you'll learn how to merge multiple CSV files using pd.concat(), automate file discovery with glob, and handle common issues like mismatched columns and duplicate headers.

CSV files for following examples

sales_jan.csv
OrderID,Product,Quantity,Price
1001,Widget,5,29.99
1002,Gadget,3,49.99
1003,Widget,7,29.99
1004,Gizmo,2,19.99
sales_feb.csv
OrderID,Product,Quantity,Price
1005,Widget,1,29.99
1006,Gadget,4,49.99
1007,Gizmo,3,19.99
1008,Widget,6,29.99
sales_mar.csv
OrderID,Product,Quantity,Price
1009,Widget,2,29.99
1010,Gadget,5,49.99
1011,Gizmo,1,19.99
1012,Widget,3,29.99
names.csv
ID,Name
1,Alice
2,Bob
3,Charlie
4,David
scores.csv
ID,Score
1,85
2,92
3,78
4,88

Method 1: Merging Specific CSV Files by Name

The simplest approach is to explicitly list the CSV files you want to merge and use pd.concat() combined with map() to read and concatenate them in one step.

How It Works

  • pd.read_csv() reads each CSV file into a DataFrame.
  • map() applies pd.read_csv() to every file path in the list.
  • pd.concat() stitches the resulting DataFrames together along the row axis.
import pandas as pd

# List the CSV files to merge
csv_files = ['sales_jan.csv', 'sales_feb.csv']

# Read and concatenate all files
df = pd.concat(map(pd.read_csv, csv_files), ignore_index=True)
print(df)

Output (example):

   OrderID Product  Quantity  Price
0 1001 Widget 5 29.99
1 1002 Gadget 3 49.99
2 1003 Widget 7 29.99
3 1004 Gizmo 2 19.99
4 1005 Widget 1 29.99
5 1006 Gadget 4 49.99
6 1007 Gizmo 3 19.99
7 1008 Widget 6 29.99
info

Setting ignore_index=True resets the index to a continuous sequence (0, 1, 2, ...). Without it, each file's original index is preserved, which can lead to duplicate index values.

Step-by-Step Alternative

If you prefer more explicit, readable code, you can read the files individually and then concatenate:

import pandas as pd

df1 = pd.read_csv('sales_jan.csv')
df2 = pd.read_csv('sales_feb.csv')
df3 = pd.read_csv('sales_mar.csv')

df = pd.concat([df1, df2, df3], ignore_index=True)
print(f"Total rows: {len(df)}")
print(df.head())

Output:

Total rows: 12
OrderID Product Quantity Price
0 1001 Widget 5 29.99
1 1002 Gadget 3 49.99
2 1003 Widget 7 29.99
3 1004 Gizmo 2 19.99
4 1005 Widget 1 29.99

This approach is fine for a small number of files, but becomes cumbersome when you have dozens or hundreds of CSVs.

Method 2: Automatically Discovering and Merging All CSV Files

When you have many CSV files in a directory (or don't know the exact filenames), use the glob and os modules to automatically find matching files and merge them.

How It Works

  • os.path.join() builds a file path pattern by combining the directory and a wildcard filename.
  • glob.glob() returns a list of all files matching the pattern.
  • pd.concat() with map() reads and merges them all.
import pandas as pd
import glob
import os

# Define the directory and file pattern
file_pattern = os.path.join("data", "sales_*.csv")

# Find all matching CSV files
csv_files = sorted(glob.glob(file_pattern))
print(f"Found {len(csv_files)} files: {csv_files}")

# Read and concatenate all files
df = pd.concat(map(pd.read_csv, csv_files), ignore_index=True)
print(df)

Output (example):

Found 3 files: ['sales_feb.csv', 'sales_jan.csv', 'sales_mar.csv']
OrderID Product Quantity Price
0 1005 Widget 1 29.99
1 1006 Gadget 4 49.99
2 1007 Gizmo 3 19.99
3 1008 Widget 6 29.99
4 1001 Widget 5 29.99
5 1002 Gadget 3 49.99
6 1003 Widget 7 29.99
7 1004 Gizmo 2 19.99
8 1009 Widget 2 29.99
9 1010 Gadget 5 49.99
10 1011 Gizmo 1 19.99
11 1012 Widget 3 29.99
tip

Use sorted() on the glob.glob() result to ensure files are processed in a consistent, predictable order. Without sorting, the order depends on the operating system and filesystem, which can vary.

This entire operation can also be condensed into a single line:

df = pd.concat(map(pd.read_csv, sorted(glob.glob(os.path.join("data", "sales_*.csv")))), ignore_index=True)

Method 3: Adding a Source Column to Track File Origin

When merging many files, it's often useful to know which file each row came from. You can add a source column during the merge:

import pandas as pd
import glob
import os

csv_files = sorted(glob.glob(os.path.join("data", "sales_*.csv")))

dataframes = []
for filepath in csv_files:
df_temp = pd.read_csv(filepath)
df_temp['source_file'] = os.path.basename(filepath)
dataframes.append(df_temp)

df = pd.concat(dataframes, ignore_index=True)
print(df)

Output (example):

    OrderID Product  Quantity  Price    source_file
0 1005 Widget 1 29.99 sales_feb.csv
1 1006 Gadget 4 49.99 sales_feb.csv
2 1007 Gizmo 3 19.99 sales_feb.csv
3 1008 Widget 6 29.99 sales_feb.csv
4 1001 Widget 5 29.99 sales_jan.csv
5 1002 Gadget 3 49.99 sales_jan.csv
6 1003 Widget 7 29.99 sales_jan.csv
7 1004 Gizmo 2 19.99 sales_jan.csv
8 1009 Widget 2 29.99 sales_mar.csv
9 1010 Gadget 5 49.99 sales_mar.csv
10 1011 Gizmo 1 19.99 sales_mar.csv
11 1012 Widget 3 29.99 sales_mar.csv

This makes it easy to filter or group by the original file later in your analysis.

Handling Common Issues

Mismatched Columns Across Files

If your CSV files have different columns, pd.concat() will still merge them but fill missing columns with NaN:

import pandas as pd

df1 = pd.DataFrame({'Name': ['Alice'], 'Age': [25]})
df2 = pd.DataFrame({'Name': ['Bob'], 'Score': [88]})

result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

    Name   Age  Score
0 Alice 25.0 NaN
1 Bob NaN 88.0
caution

This silent behavior can mask data quality issues. If all your CSV files should have the same columns, validate them before merging:

expected_columns = {'OrderID', 'Product', 'Quantity', 'Price'}
for filepath in csv_files:
df_temp = pd.read_csv(filepath)
if set(df_temp.columns) != expected_columns:
print(f"Warning: {filepath} has unexpected columns: {set(df_temp.columns)}")

Duplicate Headers Inside CSV Files

Sometimes CSV files exported from tools contain header rows repeated within the data. After merging, these show up as data rows with column names as values:

import pandas as pd

# ❌ Header rows mixed into data after concat
df = pd.concat(map(pd.read_csv, ['file1.csv', 'file2.csv']), ignore_index=True)
print(df)

Output (problematic):

  OrderID  Product Quantity  Price
0 1001 Widget 5 29.99
1 OrderID Product Quantity Price ← duplicate header row
2 1002 Gadget 3 49.99

Fix: Filter out rows that match the header values:

# ✅ Remove rows where values match column names
df = df[df['OrderID'] != 'OrderID']

# Reset data types after filtering
df['OrderID'] = df['OrderID'].astype(int)
df['Price'] = df['Price'].astype(float)
print(df)

Merging Large Files Efficiently

For very large CSV files, reading all of them into memory at once can be problematic. Use chunked reading or specify data types upfront to reduce memory usage:

import pandas as pd
import glob

csv_files = sorted(glob.glob("data/sales_*.csv"))

# Specify data types to reduce memory consumption
dtypes = {'OrderID': 'int32', 'Quantity': 'int16', 'Price': 'float32', 'Product': 'category'}

dataframes = [pd.read_csv(f, dtype=dtypes) for f in csv_files]
df = pd.concat(dataframes, ignore_index=True)

print(f"Total rows: {len(df)}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB")

Output:

Total rows: 12000
Memory usage: 5.00 MB

Row-wise vs Column-wise Concatenation

By default, pd.concat() stacks DataFrames vertically (row-wise, axis=0). If your CSV files contain different columns for the same records and you want to merge them horizontally, use axis=1:

import pandas as pd

df_names = pd.read_csv('names.csv') # Contains: ID, Name
df_scores = pd.read_csv('scores.csv') # Contains: ID, Score

# Horizontal concatenation (use merge for key-based joining)
result = pd.concat([df_names, df_scores], axis=1)
print(result)

Output:

   ID     Name  ID  Score
0 1 Alice 1 85
1 2 Bob 2 92
2 3 Charlie 3 78
3 4 David 4 88
tip

For horizontal merging based on a shared key column, use pd.merge() instead of pd.concat(axis=1). The concat function aligns by index position, not by logical keys, which can produce incorrect pairings if rows aren't in the same order.

Summary

MethodBest ForKey Function
Explicit file list with map()Small number of known filespd.concat(map(pd.read_csv, files))
glob pattern matchingMany files or dynamic discoveryglob.glob(os.path.join(dir, "*.csv"))
Loop with source trackingTracing data originManual loop with source_file column

To merge multiple CSV files into a single Pandas DataFrame:

  • use pd.concat() to stack them vertically
  • use glob.glob() to automatically discover files
  • use ignore_index=True to maintain a clean index.

Always validate that your files share the same column structure, handle duplicate headers, and consider memory-efficient data types for large datasets.