Skip to main content

Python Pandas: How to Merge Multiple TSV Files by a Common Key Using Python Pandas

When working with real-world data, the information you need is often scattered across multiple files. For instance, customer details might be in one TSV file, account information in another, and transaction records in a third. To perform meaningful analysis, you need to merge these files into a single unified DataFrame using a common key column.

In this guide, you'll learn how to merge multiple TSV (Tab-Separated Values) files in Python using Pandas, handle different join types, and build a scalable approach that works whether you have 2 files or 200.

The General Approach

The strategy for merging multiple TSV files is straightforward:

  1. Read the first two TSV files into DataFrames.
  2. Merge them on a common key column using pd.merge().
  3. Iterate over any remaining files, merging each one into the accumulated result.
  4. Save the final merged DataFrame to a new TSV file.

This iterative approach scales well because each merge builds upon the previous result.

Merging TSV Files with an Inner Join

Suppose you have four TSV files: Customer.tsv, Account.tsv, Branch.tsv, and Loan.tsv, all sharing a common ID column, as follow:

Customer.tsv
ID	Name	Age
1 Alice 30
2 Bob 25
3 Charlie 28
4 David 40
Account.tsv
ID	Account_No	Balance
1 100234 5000
2 100567 3000
3 100891 4500
5 100999 7000
Branch.tsv
ID	Branch
1 Downtown
2 Midtown
3 Uptown
4 Suburb
6 Airport
Loan.tsv
ID	Loan_Amount
1 50000
2 30000
3 45000
7 60000
Course.tsv
Course_ID	Course_Name
101 Physics
102 Chemistry
103 Biology
104 Mathematics
Teacher.tsv
Course_ID	Teacher
101 Smith
102 Jones
104 Brown
105 Taylor
Credits.tsv
Course_ID	Credits
101 4
102 3
103 3
106 2
Marks.tsv
Course_ID	Marks
101 85
103 78
104 90
107 88

An inner join keeps only the rows where the key column has matching values in all files. This is useful when you want a complete dataset with no missing values.

import pandas as pd

# Read the first two TSV files
tsv1 = pd.read_csv("Customer.tsv", sep='\t')
tsv2 = pd.read_csv("Account.tsv", sep='\t')

# Merge on the common 'ID' column using an inner join
merged_df = pd.merge(tsv1, tsv2, on='ID', how='inner')

# List of remaining TSV files to merge
remaining_files = ["Branch.tsv", "Loan.tsv"]

# Iteratively merge each file into the result
for filename in remaining_files:
tsv = pd.read_csv(filename, sep='\t')
merged_df = pd.merge(merged_df, tsv, on='ID', how='inner')

# Save the final result
merged_df.to_csv("Output.tsv", sep="\t", header=True, index=False)

print(merged_df)

Output (example):

   ID     Name  Age  Account_No  Balance    Branch  Loan_Amount
0 1 Alice 30 100234 5000 Downtown 50000
1 2 Bob 25 100567 3000 Midtown 30000
2 3 Charlie 28 100891 4500 Uptown 45000

With an inner join, only rows where the ID exists in all four files appear in the result.

Merging TSV Files with an Outer Join

An outer join keeps all rows from all files. Where a key doesn't exist in a particular file, the missing values are filled with NaN. This is useful when you want to preserve every record, even if some data is incomplete.

import pandas as pd

# Read the first two TSV files
tsv1 = pd.read_csv("Course.tsv", sep='\t')
tsv2 = pd.read_csv("Teacher.tsv", sep='\t')

# Merge on the common 'Course_ID' column using an outer join
merged_df = pd.merge(tsv1, tsv2, on='Course_ID', how='outer')

# List of remaining TSV files to merge
remaining_files = ["Credits.tsv", "Marks.tsv"]

# Iteratively merge each file into the result
for filename in remaining_files:
tsv = pd.read_csv(filename, sep='\t')
merged_df = pd.merge(merged_df, tsv, on='Course_ID', how='outer')

# Save the result, replacing NaN with "NA" for readability
merged_df.to_csv("Output_outer.tsv", sep="\t", header=True, index=False, na_rep="NA")

print(merged_df)

Output (example):

   Course_ID  Course_Name Teacher  Credits  Marks
0 101 Physics Smith 4.0 85.0
1 102 Chemistry Jones 3.0 NaN
2 103 Biology NaN 3.0 78.0
3 104 Mathematics Brown NaN 90.0
4 105 NaN Taylor NaN NaN
5 106 NaN NaN 2.0 NaN
6 107 NaN NaN NaN 88.0
info

When saving with to_csv(), use na_rep="NA" to replace NaN values with a readable placeholder in the output file. This makes the TSV file easier to inspect manually.

A Scalable Function for Merging Any Number of TSV Files

For cleaner, reusable code, wrap the merge logic in a function:

import pandas as pd
from typing import List

def merge_tsv_files(
file_paths: List[str],
key_column: str,
join_type: str = 'inner'
) -> pd.DataFrame:
"""
Merge multiple TSV files on a common key column.

Args:
file_paths: List of paths to TSV files.
key_column: The column name shared across all files.
join_type: Type of join, 'inner', 'outer', 'left', or 'right'.

Returns:
A single merged DataFrame.
"""
if len(file_paths) < 2:
raise ValueError("At least two files are required for merging.")

# Read and merge the first two files
merged_df = pd.merge(
pd.read_csv(file_paths[0], sep='\t'),
pd.read_csv(file_paths[1], sep='\t'),
on=key_column,
how=join_type
)

# Merge remaining files iteratively
for path in file_paths[2:]:
tsv = pd.read_csv(path, sep='\t')
merged_df = pd.merge(merged_df, tsv, on=key_column, how=join_type)

return merged_df


# Usage
files = ["Customer.tsv", "Account.tsv", "Branch.tsv", "Loan.tsv"]
result = merge_tsv_files(files, key_column='ID', join_type='inner')

result.to_csv("Output.tsv", sep="\t", header=True, index=False)
print(result)

Output:

   ID     Name  Age  Account_No  Balance    Branch  Loan_Amount
0 1 Alice 30 100234 5000 Downtown 50000
1 2 Bob 25 100567 3000 Midtown 30000
2 3 Charlie 28 100891 4500 Uptown 45000

This function works with any number of TSV files and any join type, making it easy to reuse across projects.

Using glob to Automatically Discover TSV Files

If you have many TSV files in a directory and don't want to list them manually, use the glob module to find them automatically:

import glob
import pandas as pd
import os

# Find all TSV files in the current directory
tsv_files = sorted(glob.glob("*.tsv"))

if not tsv_files:
raise FileNotFoundError("No TSV files found in the current directory.")

print(f"Found {len(tsv_files)} files: {tsv_files}")

# Read the first file as the base DataFrame
merged_df = pd.read_csv(tsv_files[0], sep='\t')

# Merge remaining files iteratively
for filepath in tsv_files[1:]:
tsv = pd.read_csv(filepath, sep='\t')

# Determine common column(s) to merge on
common_cols = list(set(merged_df.columns).intersection(tsv.columns))

if 'ID' in common_cols:
key = 'ID'
merged_df = pd.merge(merged_df, tsv, on=key, how='outer')
elif len(common_cols) == 1:
key = common_cols[0] # If only one common column, use it
merged_df = pd.merge(merged_df, tsv, on=key, how='outer')
else:
print(f"No suitable key column to merge {os.path.basename(filepath)} with base DataFrame.")


print("Merged DataFrame:")
print(merged_df)

Output:

Found 8 files: ['Account.tsv', 'Branch.tsv', 'Course.tsv', 'Credits.tsv', 'Customer.tsv', 'Loan.tsv', 'Marks.tsv', 'Teacher.tsv']
No suitable key column to merge Course.tsv with base DataFrame.
No suitable key column to merge Credits.tsv with base DataFrame.
No suitable key column to merge Marks.tsv with base DataFrame.
No suitable key column to merge Teacher.tsv with base DataFrame.
Merged DataFrame:
ID Account_No Balance Branch Name Age Loan_Amount
0 1 100234.0 5000.0 Downtown Alice 30.0 50000.0
1 2 100567.0 3000.0 Midtown Bob 25.0 30000.0
2 3 100891.0 4500.0 Uptown Charlie 28.0 45000.0
3 4 NaN NaN Suburb David 40.0 NaN
4 5 100999.0 7000.0 NaN NaN NaN NaN
5 6 NaN NaN Airport NaN NaN NaN
6 7 NaN NaN NaN NaN NaN 60000.0
tip

Use sorted() on glob.glob() results to ensure files are processed in a consistent order. File system ordering can vary across operating systems.

Common Mistake: Mismatched Key Column Names

A frequent error occurs when the common key column has slightly different names across files (e.g., ID vs Id vs id):

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'Id': [1, 2], 'Age': [25, 30]})

# ❌ This raises a KeyError because column names don't match exactly
try:
result = pd.merge(df1, df2, on='ID', how='inner')
except KeyError as e:
print(f"Error: {e}")

Output:

Error: 'ID'

The fix is to use left_on and right_on when column names differ:

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'Id': [1, 2], 'Age': [25, 30]})

# ✅ Specify different column names for each DataFrame
result = pd.merge(df1, df2, left_on='ID', right_on='Id', how='inner')
result = result.drop(columns=['Id']) # Remove the duplicate key column
print(result)

Output:

   ID   Name  Age
0 1 Alice 25
1 2 Bob 30

Alternatively, rename the columns before merging to keep your code consistent:

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'Id': [1, 2], 'Age': [25, 30]})

# ✅ Standardize column names first
df2 = df2.rename(columns={'Id': 'ID'})
result = pd.merge(df1, df2, on='ID', how='inner')
print(result)

Output:

   ID   Name  Age
0 1 Alice 25
1 2 Bob 30

Choosing the Right Join Type

Join TypeKeepsUse When
innerOnly rows with matching keys in all filesYou need complete records with no missing values
outerAll rows from all filesYou want to preserve every record, even with gaps
leftAll rows from the first fileYour base file is the "source of truth"
rightAll rows from the last merged fileThe newest file should drive the result

Summary

Merging multiple TSV files in Pandas follows a simple iterative pattern: read the first two files, merge them on a common key, and then loop through remaining files to merge each one into the accumulated result.

Use how='inner' when you need only complete matches, or how='outer' when you want to preserve all records regardless of gaps.

For scalable code, wrap the logic in a reusable function and consider using glob to automatically discover files in a directory.

Always verify that your key column names are consistent across files to avoid merge errors.