Skip to main content

Python Pandas: How to Merge, Join, and Concatenate DataFrames

Combining data from multiple sources is a core operation in data analysis. Pandas provides three primary methods for this - concat(), merge(), and join() - each designed for different scenarios. Understanding when and how to use each method is essential for efficient data manipulation. This guide explains all three approaches with practical examples, covers the different join types, and highlights common mistakes to avoid.

Overview: When to Use Each Method

MethodCombines Based OnBest For
concat()Position (rows or columns)Stacking DataFrames vertically or side by side
merge()Shared column valuesSQL-style joins on key columns
join()Index valuesCombining DataFrames using their index

Concatenating DataFrames with concat()

The concat() function stacks DataFrames along a specified axis. By default, it stacks them vertically (adding rows), which is useful when combining datasets that share the same column structure.

Vertical Concatenation (Stacking Rows)

import pandas as pd

df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({
'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7])

df3 = pd.DataFrame({
'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']
}, index=[8, 9, 10, 11])

result = pd.concat([df1, df2, df3])
print(result)

Output:

      A    B    C    D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

All three DataFrames are stacked vertically, preserving their original indices.

Resetting the Index After Concatenation

Use ignore_index=True to create a fresh 0-based index:

import pandas as pd

df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({
'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7])

df3 = pd.DataFrame({
'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']
}, index=[8, 9, 10, 11])

result = pd.concat([df1, df2, df3], ignore_index=True)
print(result)

Output:

      A    B    C    D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

Horizontal Concatenation (Adding Columns Side by Side)

Set axis=1 to place DataFrames next to each other:

import pandas as pd

df_a = pd.DataFrame({'A': [1, 2, 3]})
df_b = pd.DataFrame({'B': [4, 5, 6]})

result = pd.concat([df_a, df_b], axis=1)
print(result)

Output:

   A  B
0 1 4
1 2 5
2 3 6

Handling Mismatched Columns

When DataFrames have different columns, concat() fills missing values with NaN by default (outer join). Use join='inner' to keep only shared columns:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})

# Outer join (default): keeps all columns
outer = pd.concat([df1, df2], ignore_index=True)
print("Outer join:")
print(outer)

# Inner join: keeps only shared columns
inner = pd.concat([df1, df2], join='inner', ignore_index=True)
print("\nInner join:")
print(inner)

Output:

Outer join:
A B C
0 1 3.0 NaN
1 2 4.0 NaN
2 5 NaN 7.0
3 6 NaN 8.0

Inner join:
A
0 1
1 2
2 5
3 6

Merging DataFrames with merge()

The merge() function combines DataFrames based on shared column values, similar to SQL joins. It is the most flexible method for combining datasets that share key columns.

Inner Merge (Default)

An inner merge keeps only rows where the key exists in both DataFrames:

import pandas as pd

left = pd.DataFrame({
'Key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
})

right = pd.DataFrame({
'Key': ['K0', 'K1', 'K4', 'K5'],
'C': ['C0', 'C1', 'C4', 'C5'],
'D': ['D0', 'D1', 'D4', 'D5']
})

inner = pd.merge(left, right, how='inner', on='Key')
print(inner)

Output:

  Key   A   B   C   D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1

Only K0 and K1 exist in both DataFrames, so only those rows appear.

Left Merge

A left merge keeps all rows from the left DataFrame, filling unmatched right columns with NaN:

import pandas as pd

left = pd.DataFrame({
'Key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
})

right = pd.DataFrame({
'Key': ['K0', 'K1', 'K4', 'K5'],
'C': ['C0', 'C1', 'C4', 'C5'],
'D': ['D0', 'D1', 'D4', 'D5']
})

left_merged = pd.merge(left, right, how='left', on='Key')
print(left_merged)

Output:

  Key   A   B    C    D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 NaN NaN
3 K3 A3 B3 NaN NaN

Right Merge

A right merge keeps all rows from the right DataFrame:

import pandas as pd

left = pd.DataFrame({
'Key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
})

right = pd.DataFrame({
'Key': ['K0', 'K1', 'K4', 'K5'],
'C': ['C0', 'C1', 'C4', 'C5'],
'D': ['D0', 'D1', 'D4', 'D5']
})

right_merged = pd.merge(left, right, how='right', on='Key')
print(right_merged)

Output:

  Key    A    B   C   D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K4 NaN NaN C4 D4
3 K5 NaN NaN C5 D5

Outer Merge

An outer merge keeps all rows from both DataFrames:

import pandas as pd

left = pd.DataFrame({
'Key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
})

right = pd.DataFrame({
'Key': ['K0', 'K1', 'K4', 'K5'],
'C': ['C0', 'C1', 'C4', 'C5'],
'D': ['D0', 'D1', 'D4', 'D5']
})

outer_merged = pd.merge(left, right, how='outer', on='Key')
print(outer_merged)

Output:

  Key    A    B    C    D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 NaN NaN
3 K3 A3 B3 NaN NaN
4 K4 NaN NaN C4 D4
5 K5 NaN NaN C5 D5
tip

Think of merge join types like this:

  • Inner: only matching rows (intersection)
  • Left: all left rows + matching right rows
  • Right: all right rows + matching left rows
  • Outer: all rows from both sides (union)

Merging on Multiple Keys

You can merge on more than one column by passing a list to on:

import pandas as pd

left = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice'],
'Year': [2023, 2023, 2024],
'Score': [88, 92, 95]
})

right = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice'],
'Year': [2023, 2023, 2024],
'Grade': ['A', 'A+', 'A+']
})

result = pd.merge(left, right, on=['Name', 'Year'])
print(result)

Output:

    Name  Year  Score Grade
0 Alice 2023 88 A
1 Bob 2023 92 A+
2 Alice 2024 95 A+

Joining DataFrames with join()

The join() method combines DataFrames based on their index values. It is a convenient shortcut for index-based merging:

import pandas as pd

left = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
}, index=['K0', 'K1', 'K2', 'K3'])

right = pd.DataFrame({
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}, index=['K0', 'K1', 'K2', 'K3'])

result = left.join(right)
print(result)

Output:

     A   B   C   D
K0 A0 B0 C0 D0
K1 A1 B1 C1 D1
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3

Joining with Mismatched Indices

When indices don't fully overlap, join() performs a left join by default:

import pandas as pd

left = pd.DataFrame({'A': [1, 2, 3]}, index=['a', 'b', 'c'])
right = pd.DataFrame({'B': [4, 5, 6]}, index=['b', 'c', 'd'])

print("Left join (default):")
print(left.join(right))

print("\nOuter join:")
print(left.join(right, how='outer'))

Output:

Left join (default):
A B
a 1 NaN
b 2 4.0
c 3 5.0

Outer join:
A B
a 1.0 NaN
b 2.0 4.0
c 3.0 5.0
d NaN 6.0

Joining Multiple DataFrames at Once

A key advantage of join() is that it accepts a list of DataFrames:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2]}, index=['x', 'y'])
df2 = pd.DataFrame({'B': [3, 4]}, index=['x', 'y'])
df3 = pd.DataFrame({'C': [5, 6]}, index=['x', 'y'])

result = df1.join([df2, df3])
print(result)

Output:

   A  B  C
x 1 3 5
y 2 4 6

Common Mistake: Unexpected Duplicates After Merging

A frequent issue is getting more rows than expected after a merge due to duplicate keys in one or both DataFrames. Merge produces a row for every combination of matching keys:

import pandas as pd

left = pd.DataFrame({'Key': ['A', 'A', 'B'], 'Value': [1, 2, 3]})
right = pd.DataFrame({'Key': ['A', 'A', 'B'], 'Score': [10, 20, 30]})

result = pd.merge(left, right, on='Key')
print(result)
print(f"\nLeft rows: {len(left)}, Right rows: {len(right)}, Merged rows: {len(result)}")

Output:

  Key  Value  Score
0 A 1 10
1 A 1 20
2 A 2 10
3 A 2 20
4 B 3 30

Left rows: 3, Right rows: 3, Merged rows: 5

Key 'A' appears twice in both DataFrames, creating a 2×2 = 4 row cartesian product for that key alone.

To prevent this, use the validate parameter to enforce key uniqueness:

try:
result = pd.merge(left, right, on='Key', validate='one_to_one')
except pd.errors.MergeError as e:
print(f"MergeError: {e}")

Output:

MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one merge
danger

Always check for duplicate keys before merging. Unexpected duplicates can silently multiply your rows, leading to inflated counts, incorrect aggregations, and misleading analysis results. Use validate='one_to_one', 'one_to_many', or 'many_to_one' to catch these issues early.

Quick Reference

OperationFunctionCombines OnSupports Multiple Join TypesAccepts Multiple DataFrames
Concatenatepd.concat()PositionInner, OuterYes (list)
Mergepd.merge()Column valuesInner, Left, Right, OuterNo (two at a time)
Joindf.join()IndexInner, Left, Right, OuterYes (list)

Use concat() when stacking DataFrames with the same structure, merge() when combining on shared column values, and join() when aligning DataFrames by their index.

For merging more than two DataFrames on a key column, chain merge() calls or use functools.reduce() for cleaner code.