Python Pandas: How to Split a DataFrame into Chunks
When working with large Pandas DataFrames, it's often necessary to split them into smaller, more manageable chunks. This can be for batch processing, distributing work, or simply for easier inspection. Pandas, often in conjunction with NumPy, provides several effective ways to divide a DataFrame into multiple smaller DataFrames.
This guide explains how to split a Pandas DataFrame into a specific number of chunks or into chunks of a specific number of rows, using methods like numpy.array_split and DataFrame slicing.
Why Split a DataFrame?
- Memory Management: Processing very large DataFrames can consume significant memory. Splitting allows you to process data in smaller, memory-friendly pieces.
- Batch Processing: Many operations (e.g., writing to a database, making API calls) are more efficient or required to be done in batches.
- Parallel Processing: You can distribute chunks to different processes or threads for parallel computation (though libraries like Dask are often better suited for large-scale parallelism).
- Sampling/Subsetting: Creating smaller representative samples or subsets for testing or focused analysis.
- Iteration: When you need to perform an operation on sequential blocks of rows.
Example DataFrame: We'll use the following DataFrame for our examples:
import pandas as pd
import numpy as np # For array_split
data = {
'id': range(1, 11), # 10 rows
'product_name': [f'Product {chr(65+i)}' for i in range(10)],
'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
'price': np.random.randint(10, 100, 10) * 1.99
}
df = pd.DataFrame(data)
print("Original DataFrame (first 5 rows):")
print(df.head())
Output (example, prices will vary):
Original DataFrame (first 5 rows):
id product_name category price
0 1 Product A Elec 47.76
1 2 Product B Book 89.55
2 3 Product C Home 183.08
3 4 Product D Elec 71.64
4 5 Product E Book 61.69
Method 1: Splitting into N Equal(ish) Chunks using numpy.array_split() (Recommended)
The numpy.array_split(ary, indices_or_sections) function is a versatile way to split a NumPy array (and thus a Pandas DataFrame, which is built on NumPy arrays) into a specific number of nearly equal sub-arrays (or sub-DataFrames).
Installation
Ensure you have Pandas and NumPy installed:
pip install pandas numpy
# Or
pip3 install pandas numpy
How It Works
ary: The array or DataFrame to be split.indices_or_sections:- If an integer
N, the array will be divided intoNsub-arrays. If the array doesn't divide evenly, the first few sub-arrays will be slightly larger. - If a 1-D array of sorted integers, these integers indicate the points at which the array is split.
- If an integer
np.array_split() returns a list of DataFrames (when a DataFrame is passed in).
Example Usage
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': range(1, 11),
'product_name': [f'Product {chr(65+i)}' for i in range(10)],
'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
'price': np.round(np.random.rand(10) * 100, 2)
})
# Split the DataFrame into 3 chunks
num_chunks = 3
list_of_dfs = np.array_split(df, num_chunks)
print(f"Splitting into {num_chunks} chunks using np.array_split():")
for i, chunk_df in enumerate(list_of_dfs):
print(f"\n--- Chunk {i+1} (shape: {chunk_df.shape}) ---")
print(chunk_df)
Output:
Splitting into 3 chunks using np.array_split():
--- Chunk 1 (shape: (4, 4)) ---
id product_name category price
0 1 Product A Elec 27.08
1 2 Product B Book 5.67
2 3 Product C Home 29.66
3 4 Product D Elec 90.35
--- Chunk 2 (shape: (3, 4)) ---
id product_name category price
4 5 Product E Book 75.51
5 6 Product F Home 10.64
6 7 Product G Elec 80.14
--- Chunk 3 (shape: (3, 4)) ---
id product_name category price
7 8 Product H Book 55.35
8 9 Product I Home 10.87
9 10 Product J Elec 60.72
If the number of rows (10) is not perfectly divisible by num_chunks (3), np.array_split distributes the rows as evenly as possible. The first len(df) % num_chunks chunks will have one extra element.
Accessing Individual Chunks
Since np.array_split returns a list of DataFrames, you can access them by index:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': range(1, 11),
'product_name': [f'Product {chr(65+i)}' for i in range(10)],
'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
'price': np.round(np.random.rand(10) * 100, 2)
})
# Split the DataFrame into 3 chunks
num_chunks = 3
list_of_dfs = np.array_split(df, num_chunks)
# accessing individual chunks
first_chunk = list_of_dfs[0]
second_chunk = list_of_dfs[1]
print("--- First Chunk ---")
print(first_chunk)
Output:
--- First Chunk ---
id product_name category price
0 1 Product A Elec 27.08
1 2 Product B Book 5.67
2 3 Product C Home 29.66
3 4 Product D Elec 90.35
Method 2: Splitting Every N Rows (Creating Chunks of a Specific Size)
This approach splits the DataFrame into chunks where each chunk has a maximum of N rows (the last chunk might have fewer).
Using a for Loop and Slicing
You can iterate through the DataFrame's length with a step size and use standard DataFrame slicing.
import pandas as pd
import numpy as np
import math # For math.ceil
df = pd.DataFrame({
'id': range(1, 11),
'product_name': [f'Product {chr(65+i)}' for i in range(10)],
'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
'price': np.round(np.random.rand(10) * 100, 2)
})
def split_df_every_n_rows_loop(dataframe, chunk_size):
"""Splits a DataFrame into chunks of `chunk_size` rows using a loop."""
list_of_chunks = []
num_full_chunks = len(dataframe) // chunk_size
total_chunks = math.ceil(len(dataframe) / chunk_size) # Or num_full_chunks + 1 if len % chunk_size != 0
for i in range(total_chunks):
start_index = i * chunk_size
end_index = start_index + chunk_size
# Slicing handles the last chunk correctly (doesn't go out of bounds)
list_of_chunks.append(dataframe[start_index:end_index])
return list_of_chunks
chunk_size = 3 # Split into chunks of 3 rows
chunks_by_size_loop = split_df_every_n_rows_loop(df, chunk_size)
print(f"\nSplitting into chunks of size {chunk_size} (loop):")
for i, chunk_df in enumerate(chunks_by_size_loop):
print(f"\n--- Chunk {i+1} (shape: {chunk_df.shape}) ---")
print(chunk_df)
Output:
Splitting into chunks of size 3 (loop):
--- Chunk 1 (shape: (3, 4)) ---
id product_name category price
0 1 Product A Elec 92.57
1 2 Product B Book 71.40
2 3 Product C Home 76.34
--- Chunk 2 (shape: (3, 4)) ---
id product_name category price
3 4 Product D Elec 85.83
4 5 Product E Book 10.12
5 6 Product F Home 98.13
--- Chunk 3 (shape: (3, 4)) ---
id product_name category price
6 7 Product G Elec 13.58
7 8 Product H Book 82.77
8 9 Product I Home 83.03
--- Chunk 4 (shape: (1, 4)) ---
id product_name category price
9 10 Product J Elec 12.24
dataframe[start_index:end_index]: Standard Python/Pandas slicing. It gracefully handles cases whereend_indexexceeds the DataFrame length.