How to Speed Up Pandas with Modin in Python

Pandas is the go-to library for data manipulation in Python, but it was designed to run on a single CPU core. When working with large datasets: tens of gigabytes or more: operations like reading files, filling missing values, and aggregating data can become painfully slow. Modin is a drop-in replacement for Pandas that distributes operations across all available CPU cores, often delivering significant speedups with just a single line change. This guide explains how to set up Modin, demonstrates its performance advantages, and covers important considerations.

What Is Modin and How Does It Work?

Modin is a Python library that provides the same API as Pandas but parallelizes operations behind the scenes. Instead of processing data on a single core, Modin partitions the DataFrame across multiple cores and executes operations concurrently.

Pandas:  [Single Core] -> processes entire DataFrame sequentially
Modin:   [Core 1] -> partition 1
         [Core 2] -> partition 2    -> combines results
         [Core 3] -> partition 3
         [Core 4] -> partition 4

The key benefit is that you don't need to learn a new API: Modin aims for full compatibility with the Pandas API, so switching requires minimal code changes.

Installation

Install Modin with your preferred parallel execution backend:

# Using Ray as the backend (recommended)
pip install "modin[ray]"

# Using Dask as the backend
pip install "modin[dask]"

# Install all backends
pip install "modin[all]"

The One-Line Change

The core idea is replacing your Pandas import with Modin's Pandas module. Everything else stays the same:

# Before (standard Pandas)
import pandas as pd

# After (Modin - parallelized Pandas)
import modin.pandas as pd

That's it. All your existing Pandas code: read_csv(), fillna(), groupby(), merge(), and more: works with this single import change.

Example 1: Speeding Up DataFrame Concatenation

This example demonstrates the performance difference when repeatedly concatenating a DataFrame:

import time
import pandas as pd
import modin.pandas as mpd

# Sample data
data = {
    'Name': ['Tom', 'Nick', 'Krish', 'Jack', 'Ash', 'Singh', 'Shilpa', 'Nav'],
    'Age': [20, 21, 19, 18, 6, 12, 18, 20]
}

# --- Pandas ---
df = pd.DataFrame(data)
start = time.time()

frames = [df] * 1024  # Create 1024 copies
result_pd = pd.concat(frames, ignore_index=True)

pandas_time = time.time() - start
print(f"Pandas concat time: {pandas_time:.4f} seconds")
print(f"Result shape: {result_pd.shape}")

# --- Modin ---
modin_df = mpd.DataFrame(data)
start = time.time()

frames_modin = [modin_df] * 1024
result_modin = mpd.concat(frames_modin, ignore_index=True)

modin_time = time.time() - start
print(f"Modin concat time:  {modin_time:.4f} seconds")
print(f"Speedup: {pandas_time / modin_time:.1f}x")

Output (approximate - varies by hardware):

Pandas concat time: 0.6829 seconds
Result shape: (8192, 2)
Modin concat time:  0.0277 seconds
Speedup: 24.7x

info

The append() method was deprecated in Pandas 2.0. Use pd.concat() instead, which works identically in both Pandas and Modin.

Example 2: Speeding Up `fillna()` on a Large Dataset

Operations that scan the entire DataFrame benefit significantly from parallelization. Here, fillna() replaces all NaN values across a large CSV file:

import time
import pandas as pd
import modin.pandas as mpd

# --- Pandas ---
df = pd.read_csv("large_dataset.csv")  # ~600 MB file
start = time.time()
df = df.fillna(value=0)
pandas_time = time.time() - start
print(f"Pandas fillna: {pandas_time:.2f} seconds")

# --- Modin ---
modin_df = mpd.read_csv("large_dataset.csv")
start = time.time()
modin_df = modin_df.fillna(value=0)
modin_time = time.time() - start
print(f"Modin fillna:  {modin_time:.2f} seconds")
print(f"Speedup: {pandas_time / modin_time:.1f}x")

Output (approximate):

Pandas fillna: 1.20 seconds
Modin fillna:  0.27 seconds
Speedup: 4.4x

Example 3: Speeding Up `read_csv()`

Reading large CSV files is often the first bottleneck. Modin parallelizes the file reading process itself:

import time
import pandas as pd
import modin.pandas as mpd

# --- Pandas ---
start = time.time()
df = pd.read_csv("large_dataset.csv")
print(f"Pandas read_csv: {time.time() - start:.2f} seconds")

# --- Modin ---
start = time.time()
modin_df = mpd.read_csv("large_dataset.csv")
print(f"Modin read_csv:  {time.time() - start:.2f} seconds")

Configuring Modin

Limiting CPU Usage

By default, Modin uses all available cores. To limit it: for example, to leave cores free for other processes:

import os
os.environ["MODIN_CPUS"] = "4"  # Use only 4 cores

import modin.pandas as pd

warning

The MODIN_CPUS environment variable must be set before importing Modin. Setting it afterward has no effect.

Choosing the Backend

Modin supports two execution backends: Ray and Dask. Set the backend via an environment variable:

import os
os.environ["MODIN_BACKEND"] = "ray"   # or "dask"

import modin.pandas as pd

Ray is the default and generally recommended backend. Dask may be preferred if you're already using the Dask ecosystem.

When Modin Helps Most

Modin provides the biggest speedups in specific scenarios:

Scenario	Expected Speedup	Why
Large CSV file reading	2-5x	File is read in parallel chunks
Element-wise operations (`fillna`, `apply`, `replace`)	3-10x	Work is distributed across cores
Concatenation of many DataFrames	10-25x	Partitions are combined in parallel
GroupBy and aggregation	2-5x	Groups are processed concurrently
Small DataFrames (< 1 MB)	None or slower	Parallelization overhead exceeds benefit

When Modin May Not Help

Modin is not always faster. In some cases, the overhead of distributing work outweighs the benefits:

import modin.pandas as pd

# Small DataFrames - Modin overhead makes it slower than Pandas
small_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
result = small_df.sum()  # Parallelization overhead > computation time

tip

Modin shines with large datasets (hundreds of MBs to GBs) and operations that touch many cells. For small DataFrames or quick one-off calculations, standard Pandas is often faster due to lower overhead.

Falling Back to Pandas Automatically

If Modin encounters an operation it hasn't implemented yet, it automatically falls back to Pandas and displays a warning:

UserWarning: ... defaulting to pandas implementation.

This means your code won't break: it will simply run at Pandas speed for that specific operation. Over time, Modin's coverage of the Pandas API continues to improve.

Complete Migration Example

Here is a typical data processing script converted from Pandas to Modin:

# Before: Standard Pandas
# import pandas as pd

# After: Modin (one-line change)
import modin.pandas as pd

# Everything else stays exactly the same
df = pd.read_csv("sales_data.csv")

# Clean the data
df = df.fillna(0)
df['Date'] = pd.to_datetime(df['Date'])
df = df[df['Amount'] > 0]

# Aggregate
summary = df.groupby('Region')['Amount'].agg(['sum', 'mean', 'count'])
summary = summary.sort_values('sum', ascending=False)

print(summary.head(10))

# Save results
summary.to_csv("sales_summary.csv")

No other changes are needed. The same code runs on all available cores.

Quick Reference

Task	Code
Install Modin	`pip install "modin[ray]"`
Switch from Pandas to Modin	`import modin.pandas as pd`
Limit CPU cores	`os.environ["MODIN_CPUS"] = "4"`
Choose backend	`os.environ["MODIN_BACKEND"] = "ray"`
Convert Modin DF to Pandas DF	`pandas_df = modin_df._to_pandas()`

Modin offers one of the simplest ways to accelerate Pandas workflows. By changing a single import line, you can leverage all CPU cores on your machine and achieve significant speedups on large datasets: without learning a new API or rewriting any of your existing code.

What Is Modin and How Does It Work?​

Installation​

The One-Line Change​

Example 1: Speeding Up DataFrame Concatenation​

Example 2: Speeding Up fillna() on a Large Dataset​

Example 3: Speeding Up read_csv()​

Configuring Modin​

Limiting CPU Usage​

Choosing the Backend​

When Modin Helps Most​

When Modin May Not Help​

Falling Back to Pandas Automatically​

Complete Migration Example​

Quick Reference​

Table of Contents