Python Pandas: How to Merge Multiple TSV Files by a Common Key Using Python Pandas

When working with real-world data, the information you need is often scattered across multiple files. For instance, customer details might be in one TSV file, account information in another, and transaction records in a third. To perform meaningful analysis, you need to merge these files into a single unified DataFrame using a common key column.

In this guide, you'll learn how to merge multiple TSV (Tab-Separated Values) files in Python using Pandas, handle different join types, and build a scalable approach that works whether you have 2 files or 200.

The General Approach

The strategy for merging multiple TSV files is straightforward:

Read the first two TSV files into DataFrames.
Merge them on a common key column using pd.merge().
Iterate over any remaining files, merging each one into the accumulated result.
Save the final merged DataFrame to a new TSV file.

This iterative approach scales well because each merge builds upon the previous result.

Merging TSV Files with an Inner Join

Suppose you have four TSV files: Customer.tsv, Account.tsv, Branch.tsv, and Loan.tsv, all sharing a common ID column, as follow:

Customer.tsv
ID	Name	Age
1	Alice	30
2	Bob	25
3	Charlie	28
4	David	40

Account.tsv
ID	Account_No	Balance
1	100234	5000
2	100567	3000
3	100891	4500
5	100999	7000

Branch.tsv
ID	Branch
Downtown
Midtown
Uptown
Suburb
Airport

Loan.tsv
ID	Loan_Amount
1	50000
2	30000
3	45000
7	60000

Course.tsv
Course_ID	Course_Name
101	Physics
102	Chemistry
103	Biology
104	Mathematics

Teacher.tsv
Course_ID	Teacher
101	Smith
102	Jones
104	Brown
105	Taylor

Credits.tsv
Course_ID	Credits
101	4
102	3
103	3
106	2

Marks.tsv
Course_ID	Marks
101	85
103	78
104	90
107	88

An inner join keeps only the rows where the key column has matching values in all files. This is useful when you want a complete dataset with no missing values.

import pandas as pd

# Read the first two TSV files
tsv1 = pd.read_csv("Customer.tsv", sep='\t')
tsv2 = pd.read_csv("Account.tsv", sep='\t')

# Merge on the common 'ID' column using an inner join
merged_df = pd.merge(tsv1, tsv2, on='ID', how='inner')

# List of remaining TSV files to merge
remaining_files = ["Branch.tsv", "Loan.tsv"]

# Iteratively merge each file into the result
for filename in remaining_files:
    tsv = pd.read_csv(filename, sep='\t')
    merged_df = pd.merge(merged_df, tsv, on='ID', how='inner')

# Save the final result
merged_df.to_csv("Output.tsv", sep="\t", header=True, index=False)

print(merged_df)

Output (example):

   ID     Name  Age  Account_No  Balance    Branch  Loan_Amount
 1    Alice   30      100234     5000  Downtown        50000
 2      Bob   25      100567     3000   Midtown        30000
 3  Charlie   28      100891     4500    Uptown        45000

With an inner join, only rows where the ID exists in all four files appear in the result.

Merging TSV Files with an Outer Join

An outer join keeps all rows from all files. Where a key doesn't exist in a particular file, the missing values are filled with NaN. This is useful when you want to preserve every record, even if some data is incomplete.

import pandas as pd

# Read the first two TSV files
tsv1 = pd.read_csv("Course.tsv", sep='\t')
tsv2 = pd.read_csv("Teacher.tsv", sep='\t')

# Merge on the common 'Course_ID' column using an outer join
merged_df = pd.merge(tsv1, tsv2, on='Course_ID', how='outer')

# List of remaining TSV files to merge
remaining_files = ["Credits.tsv", "Marks.tsv"]

# Iteratively merge each file into the result
for filename in remaining_files:
    tsv = pd.read_csv(filename, sep='\t')
    merged_df = pd.merge(merged_df, tsv, on='Course_ID', how='outer')

# Save the result, replacing NaN with "NA" for readability
merged_df.to_csv("Output_outer.tsv", sep="\t", header=True, index=False, na_rep="NA")

print(merged_df)

Output (example):

   Course_ID  Course_Name Teacher  Credits  Marks
      101      Physics   Smith      4.0   85.0
      102    Chemistry   Jones      3.0    NaN
      103      Biology     NaN      3.0   78.0
      104  Mathematics   Brown      NaN   90.0
      105          NaN  Taylor      NaN    NaN
      106          NaN     NaN      2.0    NaN
      107          NaN     NaN      NaN   88.0

info

When saving with to_csv(), use na_rep="NA" to replace NaN values with a readable placeholder in the output file. This makes the TSV file easier to inspect manually.

A Scalable Function for Merging Any Number of TSV Files

For cleaner, reusable code, wrap the merge logic in a function:

import pandas as pd
from typing import List

def merge_tsv_files(
    file_paths: List[str],
    key_column: str,
    join_type: str = 'inner'
) -> pd.DataFrame:
    """
    Merge multiple TSV files on a common key column.

    Args:
        file_paths: List of paths to TSV files.
        key_column: The column name shared across all files.
        join_type: Type of join, 'inner', 'outer', 'left', or 'right'.

    Returns:
        A single merged DataFrame.
    """
    if len(file_paths) < 2:
        raise ValueError("At least two files are required for merging.")

    # Read and merge the first two files
    merged_df = pd.merge(
        pd.read_csv(file_paths[0], sep='\t'),
        pd.read_csv(file_paths[1], sep='\t'),
        on=key_column,
        how=join_type
    )

    # Merge remaining files iteratively
    for path in file_paths[2:]:
        tsv = pd.read_csv(path, sep='\t')
        merged_df = pd.merge(merged_df, tsv, on=key_column, how=join_type)

    return merged_df


# Usage
files = ["Customer.tsv", "Account.tsv", "Branch.tsv", "Loan.tsv"]
result = merge_tsv_files(files, key_column='ID', join_type='inner')

result.to_csv("Output.tsv", sep="\t", header=True, index=False)
print(result)

Output:

   ID     Name  Age  Account_No  Balance    Branch  Loan_Amount
 1    Alice   30      100234     5000  Downtown        50000
 2      Bob   25      100567     3000   Midtown        30000
 3  Charlie   28      100891     4500    Uptown        45000

This function works with any number of TSV files and any join type, making it easy to reuse across projects.

Using `glob` to Automatically Discover TSV Files

If you have many TSV files in a directory and don't want to list them manually, use the glob module to find them automatically:

import glob
import pandas as pd
import os

# Find all TSV files in the current directory
tsv_files = sorted(glob.glob("*.tsv"))

if not tsv_files:
    raise FileNotFoundError("No TSV files found in the current directory.")

print(f"Found {len(tsv_files)} files: {tsv_files}")

# Read the first file as the base DataFrame
merged_df = pd.read_csv(tsv_files[0], sep='\t')

# Merge remaining files iteratively
for filepath in tsv_files[1:]:
    tsv = pd.read_csv(filepath, sep='\t')
    
    # Determine common column(s) to merge on
    common_cols = list(set(merged_df.columns).intersection(tsv.columns))
    
    if 'ID' in common_cols:
        key = 'ID'
        merged_df = pd.merge(merged_df, tsv, on=key, how='outer')
    elif len(common_cols) == 1:
        key = common_cols[0]  # If only one common column, use it
        merged_df = pd.merge(merged_df, tsv, on=key, how='outer')
    else:
        print(f"No suitable key column to merge {os.path.basename(filepath)} with base DataFrame.")

    
print("Merged DataFrame:")
print(merged_df)

Output:

Found 8 files: ['Account.tsv', 'Branch.tsv', 'Course.tsv', 'Credits.tsv', 'Customer.tsv', 'Loan.tsv', 'Marks.tsv', 'Teacher.tsv']
No suitable key column to merge Course.tsv with base DataFrame.
No suitable key column to merge Credits.tsv with base DataFrame.
No suitable key column to merge Marks.tsv with base DataFrame.
No suitable key column to merge Teacher.tsv with base DataFrame.
Merged DataFrame:
   ID  Account_No  Balance    Branch     Name   Age  Loan_Amount
0   1    100234.0   5000.0  Downtown    Alice  30.0      50000.0
1   2    100567.0   3000.0   Midtown      Bob  25.0      30000.0
2   3    100891.0   4500.0    Uptown  Charlie  28.0      45000.0
3   4         NaN      NaN    Suburb    David  40.0          NaN
4   5    100999.0   7000.0       NaN      NaN   NaN          NaN
5   6         NaN      NaN   Airport      NaN   NaN          NaN
6   7         NaN      NaN       NaN      NaN   NaN      60000.0

tip

Use sorted() on glob.glob() results to ensure files are processed in a consistent order. File system ordering can vary across operating systems.

Common Mistake: Mismatched Key Column Names

A frequent error occurs when the common key column has slightly different names across files (e.g., ID vs Id vs id):

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'Id': [1, 2], 'Age': [25, 30]})

# ❌ This raises a KeyError because column names don't match exactly
try:
    result = pd.merge(df1, df2, on='ID', how='inner')
except KeyError as e:
    print(f"Error: {e}")

Output:

Error: 'ID'

The fix is to use left_on and right_on when column names differ:

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'Id': [1, 2], 'Age': [25, 30]})

# ✅ Specify different column names for each DataFrame
result = pd.merge(df1, df2, left_on='ID', right_on='Id', how='inner')
result = result.drop(columns=['Id'])  # Remove the duplicate key column
print(result)

Output:

   ID   Name  Age
0   1  Alice   25
1   2    Bob   30

Alternatively, rename the columns before merging to keep your code consistent:

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'Id': [1, 2], 'Age': [25, 30]})

# ✅ Standardize column names first
df2 = df2.rename(columns={'Id': 'ID'})
result = pd.merge(df1, df2, on='ID', how='inner')
print(result)

Output:

   ID   Name  Age
0   1  Alice   25
1   2    Bob   30

Choosing the Right Join Type

Join Type	Keeps	Use When
`inner`	Only rows with matching keys in all files	You need complete records with no missing values
`outer`	All rows from all files	You want to preserve every record, even with gaps
`left`	All rows from the first file	Your base file is the "source of truth"
`right`	All rows from the last merged file	The newest file should drive the result

Summary

Merging multiple TSV files in Pandas follows a simple iterative pattern: read the first two files, merge them on a common key, and then loop through remaining files to merge each one into the accumulated result.

Use how='inner' when you need only complete matches, or how='outer' when you want to preserve all records regardless of gaps.

For scalable code, wrap the logic in a reusable function and consider using glob to automatically discover files in a directory.

Always verify that your key column names are consistent across files to avoid merge errors.

The General Approach​

Merging TSV Files with an Inner Join​

Merging TSV Files with an Outer Join​

A Scalable Function for Merging Any Number of TSV Files​

Using glob to Automatically Discover TSV Files​

Common Mistake: Mismatched Key Column Names​

Choosing the Right Join Type​

Summary​

Table of Contents

The General Approach

Merging TSV Files with an Inner Join

Merging TSV Files with an Outer Join

A Scalable Function for Merging Any Number of TSV Files

Using `glob` to Automatically Discover TSV Files

Common Mistake: Mismatched Key Column Names

Choosing the Right Join Type

Summary