Python Pandas: How to Combine Two DataFrames in Pandas
When working with real-world data, information is often spread across multiple sources - one DataFrame might contain customer profiles while another holds their transaction history. To analyze this data together, you need to combine these DataFrames into a single, unified structure.
Pandas provides two primary methods for combining DataFrames: concat() for stacking data vertically or horizontally, and merge() for joining data based on shared columns (similar to SQL joins).
In this guide, you'll learn how to use both methods effectively, understand their differences, and know when to choose one over the other.
Combining DataFrames with concat()
The concat() function combines DataFrames by stacking them - either adding rows on top of each other (vertically) or placing columns side by side (horizontally).
Stacking DataFrames Vertically (Adding Rows)
Vertical concatenation is the most common use of concat(). It appends one DataFrame's rows below another's:
import pandas as pd
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})
combined = pd.concat([df1, df2])
print(combined)
Output:
Name Age
0 Alice 25
1 Bob 30
0 Charlie 35
1 David 40
Notice that the index values are duplicated (both DataFrames have indices 0 and 1). To reset the index and get a clean sequential order, use ignore_index=True:
import pandas as pd
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})
combined = pd.concat([df1, df2], ignore_index=True)
print(combined)
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 40
Always use ignore_index=True when stacking DataFrames vertically unless you specifically need to preserve the original indices. Duplicate indices can cause confusion and unexpected behavior in later operations.
Stacking DataFrames Horizontally (Adding Columns)
To place DataFrames side by side, set axis=1:
import pandas as pd
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'City': ['New York', 'Los Angeles'], 'Salary': [70000, 80000]})
combined = pd.concat([df1, df2], axis=1)
print(combined)
Output:
Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 Los Angeles 80000
When concatenating horizontally, pandas aligns rows by their index values, not by position. If the indices don't match, you'll get NaN values:
import pandas as pd
df1 = pd.DataFrame({'Name': ['Alice', 'Bob']}, index=[0, 1])
df2 = pd.DataFrame({'City': ['New York', 'Chicago']}, index=[1, 2])
combined = pd.concat([df1, df2], axis=1)
print(combined)
Output:
Name City
0 Alice NaN
1 Bob New York
2 NaN Chicago
Make sure your indices align properly, or reset them with reset_index() before concatenating.
Handling Mismatched Columns
When vertically stacking DataFrames with different columns, concat() fills missing values with NaN:
import pandas as pd
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Charlie'], 'Salary': [50000]})
combined = pd.concat([df1, df2], ignore_index=True)
print(combined)
Output:
Name Age Salary
0 Alice 25.0 NaN
1 Bob 30.0 NaN
2 Charlie NaN 50000.0
Combining DataFrames with merge()
The merge() function combines DataFrames based on shared column values, similar to SQL JOIN operations. This is the right tool when you need to match rows from two DataFrames based on a common key.
Inner Join (Default)
An inner join keeps only the rows where the key column has matching values in both DataFrames:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
df2 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'David'],
'Salary': [50000, 60000, 70000]
})
merged = pd.merge(df1, df2, on='Name')
print(merged)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
Only Alice and Bob appear because they exist in both DataFrames. Charlie (only in df1) and David (only in df2) are excluded.
Outer Join
An outer join keeps all rows from both DataFrames, filling NaN where there's no match:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
df2 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'David'],
'Salary': [50000, 60000, 70000]
})
merged = pd.merge(df1, df2, on='Name', how='outer')
print(merged)
Output:
Name Age Salary
0 Alice 25.0 50000.0
1 Bob 30.0 60000.0
2 Charlie 35.0 NaN
3 David NaN 70000.0
Left Join
A left join keeps all rows from the left DataFrame and adds matching data from the right:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
df2 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'David'],
'Salary': [50000, 60000, 70000]
})
merged = pd.merge(df1, df2, on='Name', how='left')
print(merged)
Output:
Name Age Salary
0 Alice 25 50000.0
1 Bob 30 60000.0
2 Charlie 35 NaN
All three names from df1 are preserved. Charlie has NaN for Salary because there's no match in df2.
Right Join
A right join keeps all rows from the right DataFrame and adds matching data from the left:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
df2 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'David'],
'Salary': [50000, 60000, 70000]
})
merged = pd.merge(df1, df2, on='Name', how='right')
print(merged)
Output:
Name Age Salary
0 Alice 25.0 50000
1 Bob 30.0 60000
2 David NaN 70000
Merging on Different Column Names
When the key columns have different names in each DataFrame, use left_on and right_on:
import pandas as pd
employees = pd.DataFrame({
'emp_name': ['Alice', 'Bob', 'Charlie'],
'department': ['Engineering', 'Marketing', 'Sales']
})
salaries = pd.DataFrame({
'worker': ['Alice', 'Bob', 'David'],
'salary': [90000, 75000, 65000]
})
merged = pd.merge(employees, salaries, left_on='emp_name', right_on='worker')
print(merged)
Output:
emp_name department worker salary
0 Alice Engineering Alice 90000
1 Bob Marketing Bob 75000
Merging on Multiple Columns
You can merge on multiple columns simultaneously by passing a list:
import pandas as pd
df1 = pd.DataFrame({
'First': ['Alice', 'Bob', 'Alice'],
'Last': ['Smith', 'Jones', 'Brown'],
'Age': [25, 30, 28]
})
df2 = pd.DataFrame({
'First': ['Alice', 'Bob', 'Alice'],
'Last': ['Smith', 'Jones', 'White'],
'Salary': [50000, 60000, 55000]
})
merged = pd.merge(df1, df2, on=['First', 'Last'])
print(merged)
Output:
First Last Age Salary
0 Alice Smith 25 50000
1 Bob Jones 30 60000
Only rows where both First and Last match are included.
When to Use concat() vs. merge()
| Scenario | Use |
|---|---|
| Stack DataFrames with the same columns (add more rows) | concat() |
| Place DataFrames side by side (add more columns) | concat(axis=1) |
| Combine multiple DataFrames at once | concat() |
| Join DataFrames on shared key columns | merge() |
| Replicate SQL-style joins (inner, outer, left, right) | merge() |
| Match rows based on values in one or more columns | merge() |
Comprehensive Comparison
| Feature | concat() | merge() |
|---|---|---|
| Purpose | Stack or concatenate along an axis | Combine based on shared columns |
| Direction | Vertical (axis=0) or horizontal (axis=1) | Joins based on column values |
| Join types | N/A | Inner, outer, left, right |
| Multiple DataFrames | ✅ Accepts a list of many DataFrames | ❌ Two DataFrames at a time |
| Key column required | ❌ No | ✅ Yes |
| Best for | Appending similar datasets | Relating datasets with shared keys |
Conclusion
Pandas gives you two powerful tools for combining DataFrames:
-
concat()is your go-to for stacking DataFrames - appending rows from multiple sources with the same structure or placing columns side by side. It's simple, fast, and can handle any number of DataFrames at once. -
merge()is your go-to for joining DataFrames based on shared column values - just like SQL joins. It gives you fine-grained control with inner, outer, left, and right join types. -
Choose
concat()when your DataFrames share the same structure and you want to stack them. -
Choose
merge()when your DataFrames share a key column and you need to combine related information.
Together, these two methods cover virtually every data combination scenario you'll encounter in pandas.