Skip to main content

How to Speed Up Pandas with Modin in Python

Pandas is the go-to library for data manipulation in Python, but it was designed to run on a single CPU core. When working with large datasets: tens of gigabytes or more: operations like reading files, filling missing values, and aggregating data can become painfully slow. Modin is a drop-in replacement for Pandas that distributes operations across all available CPU cores, often delivering significant speedups with just a single line change. This guide explains how to set up Modin, demonstrates its performance advantages, and covers important considerations.

What Is Modin and How Does It Work?

Modin is a Python library that provides the same API as Pandas but parallelizes operations behind the scenes. Instead of processing data on a single core, Modin partitions the DataFrame across multiple cores and executes operations concurrently.

Pandas:  [Single Core] -> processes entire DataFrame sequentially
Modin: [Core 1] -> partition 1
[Core 2] -> partition 2 -> combines results
[Core 3] -> partition 3
[Core 4] -> partition 4

The key benefit is that you don't need to learn a new API: Modin aims for full compatibility with the Pandas API, so switching requires minimal code changes.

Installation

Install Modin with your preferred parallel execution backend:

# Using Ray as the backend (recommended)
pip install "modin[ray]"

# Using Dask as the backend
pip install "modin[dask]"

# Install all backends
pip install "modin[all]"

The One-Line Change

The core idea is replacing your Pandas import with Modin's Pandas module. Everything else stays the same:

# Before (standard Pandas)
import pandas as pd

# After (Modin - parallelized Pandas)
import modin.pandas as pd

That's it. All your existing Pandas code: read_csv(), fillna(), groupby(), merge(), and more: works with this single import change.

Example 1: Speeding Up DataFrame Concatenation

This example demonstrates the performance difference when repeatedly concatenating a DataFrame:

import time
import pandas as pd
import modin.pandas as mpd

# Sample data
data = {
'Name': ['Tom', 'Nick', 'Krish', 'Jack', 'Ash', 'Singh', 'Shilpa', 'Nav'],
'Age': [20, 21, 19, 18, 6, 12, 18, 20]
}

# --- Pandas ---
df = pd.DataFrame(data)
start = time.time()

frames = [df] * 1024 # Create 1024 copies
result_pd = pd.concat(frames, ignore_index=True)

pandas_time = time.time() - start
print(f"Pandas concat time: {pandas_time:.4f} seconds")
print(f"Result shape: {result_pd.shape}")

# --- Modin ---
modin_df = mpd.DataFrame(data)
start = time.time()

frames_modin = [modin_df] * 1024
result_modin = mpd.concat(frames_modin, ignore_index=True)

modin_time = time.time() - start
print(f"Modin concat time: {modin_time:.4f} seconds")
print(f"Speedup: {pandas_time / modin_time:.1f}x")

Output (approximate - varies by hardware):

Pandas concat time: 0.6829 seconds
Result shape: (8192, 2)
Modin concat time: 0.0277 seconds
Speedup: 24.7x
info

The append() method was deprecated in Pandas 2.0. Use pd.concat() instead, which works identically in both Pandas and Modin.

Example 2: Speeding Up fillna() on a Large Dataset

Operations that scan the entire DataFrame benefit significantly from parallelization. Here, fillna() replaces all NaN values across a large CSV file:

import time
import pandas as pd
import modin.pandas as mpd

# --- Pandas ---
df = pd.read_csv("large_dataset.csv") # ~600 MB file
start = time.time()
df = df.fillna(value=0)
pandas_time = time.time() - start
print(f"Pandas fillna: {pandas_time:.2f} seconds")

# --- Modin ---
modin_df = mpd.read_csv("large_dataset.csv")
start = time.time()
modin_df = modin_df.fillna(value=0)
modin_time = time.time() - start
print(f"Modin fillna: {modin_time:.2f} seconds")
print(f"Speedup: {pandas_time / modin_time:.1f}x")

Output (approximate):

Pandas fillna: 1.20 seconds
Modin fillna: 0.27 seconds
Speedup: 4.4x

Example 3: Speeding Up read_csv()

Reading large CSV files is often the first bottleneck. Modin parallelizes the file reading process itself:

import time
import pandas as pd
import modin.pandas as mpd

# --- Pandas ---
start = time.time()
df = pd.read_csv("large_dataset.csv")
print(f"Pandas read_csv: {time.time() - start:.2f} seconds")

# --- Modin ---
start = time.time()
modin_df = mpd.read_csv("large_dataset.csv")
print(f"Modin read_csv: {time.time() - start:.2f} seconds")

Configuring Modin

Limiting CPU Usage

By default, Modin uses all available cores. To limit it: for example, to leave cores free for other processes:

import os
os.environ["MODIN_CPUS"] = "4" # Use only 4 cores

import modin.pandas as pd
warning

The MODIN_CPUS environment variable must be set before importing Modin. Setting it afterward has no effect.

Choosing the Backend

Modin supports two execution backends: Ray and Dask. Set the backend via an environment variable:

import os
os.environ["MODIN_BACKEND"] = "ray" # or "dask"

import modin.pandas as pd

Ray is the default and generally recommended backend. Dask may be preferred if you're already using the Dask ecosystem.

When Modin Helps Most

Modin provides the biggest speedups in specific scenarios:

ScenarioExpected SpeedupWhy
Large CSV file reading2-5xFile is read in parallel chunks
Element-wise operations (fillna, apply, replace)3-10xWork is distributed across cores
Concatenation of many DataFrames10-25xPartitions are combined in parallel
GroupBy and aggregation2-5xGroups are processed concurrently
Small DataFrames (< 1 MB)None or slowerParallelization overhead exceeds benefit

When Modin May Not Help

Modin is not always faster. In some cases, the overhead of distributing work outweighs the benefits:

import modin.pandas as pd

# Small DataFrames - Modin overhead makes it slower than Pandas
small_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
result = small_df.sum() # Parallelization overhead > computation time
tip

Modin shines with large datasets (hundreds of MBs to GBs) and operations that touch many cells. For small DataFrames or quick one-off calculations, standard Pandas is often faster due to lower overhead.

Falling Back to Pandas Automatically

If Modin encounters an operation it hasn't implemented yet, it automatically falls back to Pandas and displays a warning:

UserWarning: ... defaulting to pandas implementation.

This means your code won't break: it will simply run at Pandas speed for that specific operation. Over time, Modin's coverage of the Pandas API continues to improve.

Complete Migration Example

Here is a typical data processing script converted from Pandas to Modin:

# Before: Standard Pandas
# import pandas as pd

# After: Modin (one-line change)
import modin.pandas as pd

# Everything else stays exactly the same
df = pd.read_csv("sales_data.csv")

# Clean the data
df = df.fillna(0)
df['Date'] = pd.to_datetime(df['Date'])
df = df[df['Amount'] > 0]

# Aggregate
summary = df.groupby('Region')['Amount'].agg(['sum', 'mean', 'count'])
summary = summary.sort_values('sum', ascending=False)

print(summary.head(10))

# Save results
summary.to_csv("sales_summary.csv")

No other changes are needed. The same code runs on all available cores.

Quick Reference

TaskCode
Install Modinpip install "modin[ray]"
Switch from Pandas to Modinimport modin.pandas as pd
Limit CPU coresos.environ["MODIN_CPUS"] = "4"
Choose backendos.environ["MODIN_BACKEND"] = "ray"
Convert Modin DF to Pandas DFpandas_df = modin_df._to_pandas()

Modin offers one of the simplest ways to accelerate Pandas workflows. By changing a single import line, you can leverage all CPU cores on your machine and achieve significant speedups on large datasets: without learning a new API or rewriting any of your existing code.