Python Pandas: How to Merge, Join, and Concatenate DataFrames

Combining data from multiple sources is a core operation in data analysis. Pandas provides three primary methods for this - concat(), merge(), and join() - each designed for different scenarios. Understanding when and how to use each method is essential for efficient data manipulation. This guide explains all three approaches with practical examples, covers the different join types, and highlights common mistakes to avoid.

Overview: When to Use Each Method

Method	Combines Based On	Best For
`concat()`	Position (rows or columns)	Stacking DataFrames vertically or side by side
`merge()`	Shared column values	SQL-style joins on key columns
`join()`	Index values	Combining DataFrames using their index

Concatenating DataFrames with `concat()`

The concat() function stacks DataFrames along a specified axis. By default, it stacks them vertically (adding rows), which is useful when combining datasets that share the same column structure.

Vertical Concatenation (Stacking Rows)

import pandas as pd

df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({
    'A': ['A4', 'A5', 'A6', 'A7'],
    'B': ['B4', 'B5', 'B6', 'B7'],
    'C': ['C4', 'C5', 'C6', 'C7'],
    'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7])

df3 = pd.DataFrame({
    'A': ['A8', 'A9', 'A10', 'A11'],
    'B': ['B8', 'B9', 'B10', 'B11'],
    'C': ['C8', 'C9', 'C10', 'C11'],
    'D': ['D8', 'D9', 'D10', 'D11']
}, index=[8, 9, 10, 11])

result = pd.concat([df1, df2, df3])
print(result)

Output:

      A    B    C    D
  A0   B0   C0   D0
  A1   B1   C1   D1
  A2   B2   C2   D2
  A3   B3   C3   D3
  A4   B4   C4   D4
  A5   B5   C5   D5
  A6   B6   C6   D6
  A7   B7   C7   D7
  A8   B8   C8   D8
  A9   B9   C9   D9
A10  B10  C10  D10
A11  B11  C11  D11

All three DataFrames are stacked vertically, preserving their original indices.

Resetting the Index After Concatenation

Use ignore_index=True to create a fresh 0-based index:

import pandas as pd

df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({
    'A': ['A4', 'A5', 'A6', 'A7'],
    'B': ['B4', 'B5', 'B6', 'B7'],
    'C': ['C4', 'C5', 'C6', 'C7'],
    'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7])

df3 = pd.DataFrame({
    'A': ['A8', 'A9', 'A10', 'A11'],
    'B': ['B8', 'B9', 'B10', 'B11'],
    'C': ['C8', 'C9', 'C10', 'C11'],
    'D': ['D8', 'D9', 'D10', 'D11']
}, index=[8, 9, 10, 11])

result = pd.concat([df1, df2, df3], ignore_index=True)
print(result)

Output:

      A    B    C    D
  A0   B0   C0   D0
  A1   B1   C1   D1
  A2   B2   C2   D2
  A3   B3   C3   D3
  A4   B4   C4   D4
  A5   B5   C5   D5
  A6   B6   C6   D6
  A7   B7   C7   D7
  A8   B8   C8   D8
  A9   B9   C9   D9
A10  B10  C10  D10
A11  B11  C11  D11

Horizontal Concatenation (Adding Columns Side by Side)

Set axis=1 to place DataFrames next to each other:

import pandas as pd

df_a = pd.DataFrame({'A': [1, 2, 3]})
df_b = pd.DataFrame({'B': [4, 5, 6]})

result = pd.concat([df_a, df_b], axis=1)
print(result)

Output:

Handling Mismatched Columns

When DataFrames have different columns, concat() fills missing values with NaN by default (outer join). Use join='inner' to keep only shared columns:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})

# Outer join (default): keeps all columns
outer = pd.concat([df1, df2], ignore_index=True)
print("Outer join:")
print(outer)

# Inner join: keeps only shared columns
inner = pd.concat([df1, df2], join='inner', ignore_index=True)
print("\nInner join:")
print(inner)

Output:

Outer join:
   A    B    C
0  1  3.0  NaN
1  2  4.0  NaN
2  5  NaN  7.0
3  6  NaN  8.0

Inner join:
   A
0  1
1  2
2  5
3  6

Merging DataFrames with `merge()`

The merge() function combines DataFrames based on shared column values, similar to SQL joins. It is the most flexible method for combining datasets that share key columns.

Inner Merge (Default)

An inner merge keeps only rows where the key exists in both DataFrames:

import pandas as pd

left = pd.DataFrame({
    'Key': ['K0', 'K1', 'K2', 'K3'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})

right = pd.DataFrame({
    'Key': ['K0', 'K1', 'K4', 'K5'],
    'C': ['C0', 'C1', 'C4', 'C5'],
    'D': ['D0', 'D1', 'D4', 'D5']
})

inner = pd.merge(left, right, how='inner', on='Key')
print(inner)

Output:

  Key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1

Only K0 and K1 exist in both DataFrames, so only those rows appear.

Left Merge

A left merge keeps all rows from the left DataFrame, filling unmatched right columns with NaN:

import pandas as pd

left = pd.DataFrame({
    'Key': ['K0', 'K1', 'K2', 'K3'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})

right = pd.DataFrame({
    'Key': ['K0', 'K1', 'K4', 'K5'],
    'C': ['C0', 'C1', 'C4', 'C5'],
    'D': ['D0', 'D1', 'D4', 'D5']
})

left_merged = pd.merge(left, right, how='left', on='Key')
print(left_merged)

Output:

  Key   A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1   C1   D1
K2  A2  B2  NaN  NaN
K3  A3  B3  NaN  NaN

Right Merge

A right merge keeps all rows from the right DataFrame:

import pandas as pd

left = pd.DataFrame({
    'Key': ['K0', 'K1', 'K2', 'K3'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})

right = pd.DataFrame({
    'Key': ['K0', 'K1', 'K4', 'K5'],
    'C': ['C0', 'C1', 'C4', 'C5'],
    'D': ['D0', 'D1', 'D4', 'D5']
})

right_merged = pd.merge(left, right, how='right', on='Key')
print(right_merged)

Output:

  Key    A    B   C   D
K0   A0   B0  C0  D0
K1   A1   B1  C1  D1
K4  NaN  NaN  C4  D4
K5  NaN  NaN  C5  D5

Outer Merge

An outer merge keeps all rows from both DataFrames:

import pandas as pd

left = pd.DataFrame({
    'Key': ['K0', 'K1', 'K2', 'K3'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})

right = pd.DataFrame({
    'Key': ['K0', 'K1', 'K4', 'K5'],
    'C': ['C0', 'C1', 'C4', 'C5'],
    'D': ['D0', 'D1', 'D4', 'D5']
})

outer_merged = pd.merge(left, right, how='outer', on='Key')
print(outer_merged)

Output:

  Key    A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1   C1   D1
K2   A2   B2  NaN  NaN
K3   A3   B3  NaN  NaN
K4  NaN  NaN   C4   D4
K5  NaN  NaN   C5   D5

tip

Think of merge join types like this:

Inner: only matching rows (intersection)
Left: all left rows + matching right rows
Right: all right rows + matching left rows
Outer: all rows from both sides (union)

Merging on Multiple Keys

You can merge on more than one column by passing a list to on:

import pandas as pd

left = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice'],
    'Year': [2023, 2023, 2024],
    'Score': [88, 92, 95]
})

right = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice'],
    'Year': [2023, 2023, 2024],
    'Grade': ['A', 'A+', 'A+']
})

result = pd.merge(left, right, on=['Name', 'Year'])
print(result)

Output:

    Name  Year  Score Grade
Alice  2023     88     A
  Bob  2023     92    A+
Alice  2024     95    A+

Joining DataFrames with `join()`

The join() method combines DataFrames based on their index values. It is a convenient shortcut for index-based merging:

import pandas as pd

left = pd.DataFrame({
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
}, index=['K0', 'K1', 'K2', 'K3'])

right = pd.DataFrame({
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
}, index=['K0', 'K1', 'K2', 'K3'])

result = left.join(right)
print(result)

Output:

     A   B   C   D
K0  A0  B0  C0  D0
K1  A1  B1  C1  D1
K2  A2  B2  C2  D2
K3  A3  B3  C3  D3

Joining with Mismatched Indices

When indices don't fully overlap, join() performs a left join by default:

import pandas as pd

left = pd.DataFrame({'A': [1, 2, 3]}, index=['a', 'b', 'c'])
right = pd.DataFrame({'B': [4, 5, 6]}, index=['b', 'c', 'd'])

print("Left join (default):")
print(left.join(right))

print("\nOuter join:")
print(left.join(right, how='outer'))

Output:

Left join (default):
   A    B
a  1  NaN
b  2  4.0
c  3  5.0

Outer join:
     A    B
a  1.0  NaN
b  2.0  4.0
c  3.0  5.0
d  NaN  6.0

Joining Multiple DataFrames at Once

A key advantage of join() is that it accepts a list of DataFrames:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2]}, index=['x', 'y'])
df2 = pd.DataFrame({'B': [3, 4]}, index=['x', 'y'])
df3 = pd.DataFrame({'C': [5, 6]}, index=['x', 'y'])

result = df1.join([df2, df3])
print(result)

Output:

   A  B  C
x  1  3  5
y  2  4  6

Common Mistake: Unexpected Duplicates After Merging

A frequent issue is getting more rows than expected after a merge due to duplicate keys in one or both DataFrames. Merge produces a row for every combination of matching keys:

import pandas as pd

left = pd.DataFrame({'Key': ['A', 'A', 'B'], 'Value': [1, 2, 3]})
right = pd.DataFrame({'Key': ['A', 'A', 'B'], 'Score': [10, 20, 30]})

result = pd.merge(left, right, on='Key')
print(result)
print(f"\nLeft rows: {len(left)}, Right rows: {len(right)}, Merged rows: {len(result)}")

Output:

  Key  Value  Score
 A      1     10
 A      1     20
 A      2     10
 A      2     20
 B      3     30

Left rows: 3, Right rows: 3, Merged rows: 5

Key 'A' appears twice in both DataFrames, creating a 2×2 = 4 row cartesian product for that key alone.

To prevent this, use the validate parameter to enforce key uniqueness:

try:
    result = pd.merge(left, right, on='Key', validate='one_to_one')
except pd.errors.MergeError as e:
    print(f"MergeError: {e}")

Output:

MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one merge

danger

Always check for duplicate keys before merging. Unexpected duplicates can silently multiply your rows, leading to inflated counts, incorrect aggregations, and misleading analysis results. Use validate='one_to_one', 'one_to_many', or 'many_to_one' to catch these issues early.

Quick Reference

Operation	Function	Combines On	Supports Multiple Join Types	Accepts Multiple DataFrames
Concatenate	`pd.concat()`	Position	Inner, Outer	Yes (list)
Merge	`pd.merge()`	Column values	Inner, Left, Right, Outer	No (two at a time)
Join	`df.join()`	Index	Inner, Left, Right, Outer	Yes (list)

Use concat() when stacking DataFrames with the same structure, merge() when combining on shared column values, and join() when aligning DataFrames by their index.

For merging more than two DataFrames on a key column, chain merge() calls or use functools.reduce() for cleaner code.

Overview: When to Use Each Method​

Concatenating DataFrames with concat()​

Vertical Concatenation (Stacking Rows)​

Resetting the Index After Concatenation​

Horizontal Concatenation (Adding Columns Side by Side)​

Handling Mismatched Columns​

Merging DataFrames with merge()​

Inner Merge (Default)​

Left Merge​

Right Merge​

Outer Merge​

Merging on Multiple Keys​

Joining DataFrames with join()​

Joining with Mismatched Indices​

Joining Multiple DataFrames at Once​

Common Mistake: Unexpected Duplicates After Merging​

Quick Reference​

Table of Contents

Overview: When to Use Each Method

Concatenating DataFrames with `concat()`

Vertical Concatenation (Stacking Rows)

Resetting the Index After Concatenation

Horizontal Concatenation (Adding Columns Side by Side)

Handling Mismatched Columns

Merging DataFrames with `merge()`

Inner Merge (Default)

Left Merge

Right Merge

Outer Merge

Merging on Multiple Keys

Joining DataFrames with `join()`

Joining with Mismatched Indices

Joining Multiple DataFrames at Once

Common Mistake: Unexpected Duplicates After Merging

Quick Reference