Python Pandas: How to Split a DataFrame into Chunks

When working with large Pandas DataFrames, it's often necessary to split them into smaller, more manageable chunks. This can be for batch processing, distributing work, or simply for easier inspection. Pandas, often in conjunction with NumPy, provides several effective ways to divide a DataFrame into multiple smaller DataFrames.

This guide explains how to split a Pandas DataFrame into a specific number of chunks or into chunks of a specific number of rows, using methods like numpy.array_split and DataFrame slicing.

Why Split a DataFrame?

Memory Management: Processing very large DataFrames can consume significant memory. Splitting allows you to process data in smaller, memory-friendly pieces.
Batch Processing: Many operations (e.g., writing to a database, making API calls) are more efficient or required to be done in batches.
Parallel Processing: You can distribute chunks to different processes or threads for parallel computation (though libraries like Dask are often better suited for large-scale parallelism).
Sampling/Subsetting: Creating smaller representative samples or subsets for testing or focused analysis.
Iteration: When you need to perform an operation on sequential blocks of rows.

Example DataFrame: We'll use the following DataFrame for our examples:

import pandas as pd
import numpy as np # For array_split

data = {
    'id': range(1, 11), # 10 rows
    'product_name': [f'Product {chr(65+i)}' for i in range(10)],
    'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
    'price': np.random.randint(10, 100, 10) * 1.99
}
df = pd.DataFrame(data)
print("Original DataFrame (first 5 rows):")
print(df.head())

Output (example, prices will vary):

Original DataFrame (first 5 rows):
   id product_name category   price
 1    Product A     Elec   47.76
 2    Product B     Book   89.55
 3    Product C     Home  183.08
 4    Product D     Elec   71.64
 5    Product E     Book   61.69

Method 1: Splitting into N Equal(ish) Chunks using `numpy.array_split()` (Recommended)

The numpy.array_split(ary, indices_or_sections) function is a versatile way to split a NumPy array (and thus a Pandas DataFrame, which is built on NumPy arrays) into a specific number of nearly equal sub-arrays (or sub-DataFrames).

Installation

Ensure you have Pandas and NumPy installed:

pip install pandas numpy
# Or 
pip3 install pandas numpy

How It Works

ary: The array or DataFrame to be split.
indices_or_sections:
- If an integer N, the array will be divided into N sub-arrays. If the array doesn't divide evenly, the first few sub-arrays will be slightly larger.
- If a 1-D array of sorted integers, these integers indicate the points at which the array is split.

np.array_split() returns a list of DataFrames (when a DataFrame is passed in).

Example Usage

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': range(1, 11),
    'product_name': [f'Product {chr(65+i)}' for i in range(10)],
    'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
    'price': np.round(np.random.rand(10) * 100, 2)
})

# Split the DataFrame into 3 chunks
num_chunks = 3
list_of_dfs = np.array_split(df, num_chunks)

print(f"Splitting into {num_chunks} chunks using np.array_split():")
for i, chunk_df in enumerate(list_of_dfs):
    print(f"\n--- Chunk {i+1} (shape: {chunk_df.shape}) ---")
    print(chunk_df)

Output:

Splitting into 3 chunks using np.array_split():

--- Chunk 1 (shape: (4, 4)) ---
   id product_name category  price
0   1    Product A     Elec  27.08
1   2    Product B     Book   5.67
2   3    Product C     Home  29.66
3   4    Product D     Elec  90.35

--- Chunk 2 (shape: (3, 4)) ---
   id product_name category  price
4   5    Product E     Book  75.51
5   6    Product F     Home  10.64
6   7    Product G     Elec  80.14

--- Chunk 3 (shape: (3, 4)) ---
   id product_name category  price
7   8    Product H     Book  55.35
8   9    Product I     Home  10.87
9  10    Product J     Elec  60.72

note

If the number of rows (10) is not perfectly divisible by num_chunks (3), np.array_split distributes the rows as evenly as possible. The first len(df) % num_chunks chunks will have one extra element.

Accessing Individual Chunks

Since np.array_split returns a list of DataFrames, you can access them by index:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': range(1, 11),
    'product_name': [f'Product {chr(65+i)}' for i in range(10)],
    'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
    'price': np.round(np.random.rand(10) * 100, 2)
})

# Split the DataFrame into 3 chunks
num_chunks = 3
list_of_dfs = np.array_split(df, num_chunks)

# accessing individual chunks
first_chunk = list_of_dfs[0]
second_chunk = list_of_dfs[1]

print("--- First Chunk ---")
print(first_chunk)

Output:

--- First Chunk ---
   id product_name category  price
0   1    Product A     Elec  27.08
1   2    Product B     Book   5.67
2   3    Product C     Home  29.66
3   4    Product D     Elec  90.35

Method 2: Splitting Every N Rows (Creating Chunks of a Specific Size)

This approach splits the DataFrame into chunks where each chunk has a maximum of N rows (the last chunk might have fewer).

Using a `for` Loop and Slicing

You can iterate through the DataFrame's length with a step size and use standard DataFrame slicing.

import pandas as pd
import numpy as np
import math # For math.ceil

df = pd.DataFrame({
    'id': range(1, 11),
    'product_name': [f'Product {chr(65+i)}' for i in range(10)],
    'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
    'price': np.round(np.random.rand(10) * 100, 2)
})

def split_df_every_n_rows_loop(dataframe, chunk_size):
  """Splits a DataFrame into chunks of `chunk_size` rows using a loop."""
  list_of_chunks = []
  num_full_chunks = len(dataframe) // chunk_size
  total_chunks = math.ceil(len(dataframe) / chunk_size) # Or num_full_chunks + 1 if len % chunk_size != 0

  for i in range(total_chunks):
    start_index = i * chunk_size
    end_index = start_index + chunk_size
    # Slicing handles the last chunk correctly (doesn't go out of bounds)
    list_of_chunks.append(dataframe[start_index:end_index])
  return list_of_chunks

chunk_size = 3 # Split into chunks of 3 rows
chunks_by_size_loop = split_df_every_n_rows_loop(df, chunk_size)

print(f"\nSplitting into chunks of size {chunk_size} (loop):")
for i, chunk_df in enumerate(chunks_by_size_loop):
    print(f"\n--- Chunk {i+1} (shape: {chunk_df.shape}) ---")
    print(chunk_df)

Output:

Splitting into chunks of size 3 (loop):

--- Chunk 1 (shape: (3, 4)) ---
   id product_name category  price
0   1    Product A     Elec  92.57
1   2    Product B     Book  71.40
2   3    Product C     Home  76.34

--- Chunk 2 (shape: (3, 4)) ---
   id product_name category  price
3   4    Product D     Elec  85.83
4   5    Product E     Book  10.12
5   6    Product F     Home  98.13

--- Chunk 3 (shape: (3, 4)) ---
   id product_name category  price
6   7    Product G     Elec  13.58
7   8    Product H     Book  82.77
8   9    Product I     Home  83.03

--- Chunk 4 (shape: (1, 4)) ---
   id product_name category  price
9  10    Product J     Elec  12.24

dataframe[start_index:end_index]: Standard Python/Pandas slicing. It gracefully handles cases where end_index exceeds the DataFrame length.

Using List Comprehension and Slicing (Concise)

A list comprehension can make the previous approach more compact.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': range(1, 11),
    'product_name': [f'Product {chr(65+i)}' for i in range(10)],
    'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
    'price': np.round(np.random.rand(10) * 100, 2)
})

def split_df_every_n_rows_comp(dataframe, chunk_size):
  """Splits a DataFrame into chunks of `chunk_size` rows using list comprehension."""
  return [dataframe[i:i + chunk_size] for i in range(0, len(dataframe), chunk_size)]

chunk_size = 4 # Split into chunks of 4 rows
chunks_by_size_comp = split_df_every_n_rows_comp(df, chunk_size)

print(f"\nSplitting into chunks of size {chunk_size} (list comprehension):")
for i, chunk_df in enumerate(chunks_by_size_comp):
    print(f"\n--- Chunk {i+1} (shape: {chunk_df.shape}) ---")
    print(chunk_df)

Example output structure for chunk_size = 4 on 10 rows:

Splitting into chunks of size 4 (list comprehension):

--- Chunk 1 (shape: (4, 4)) ---
   id product_name category  price
0   1    Product A     Elec  68.42
1   2    Product B     Book  96.00
2   3    Product C     Home  56.66
3   4    Product D     Elec  48.59

--- Chunk 2 (shape: (4, 4)) ---
   id product_name category  price
4   5    Product E     Book   8.06
5   6    Product F     Home  98.00
6   7    Product G     Elec  49.23
7   8    Product H     Book  94.92

--- Chunk 3 (shape: (2, 4)) ---
   id product_name category  price
8   9    Product I     Home  29.33
9  10    Product J     Elec  95.62

range(0, len(dataframe), chunk_size): Generates start indices for each chunk.

Note on `DataFrame.iloc` for Slici

When using slicing for this purpose (df[start:end]), Pandas implicitly uses position-based slicing similar to df.iloc[start:end], so you don't strictly need to use .iloc unless you want to be extremely explicit about positional indexing or are combining row and column positional selection. For just row slicing by position, df[start:end] is common and effective.

Choosing the Right Method

To split into a specific number (N) of roughly equal chunks: Use numpy.array_split(df, N). This is generally the easiest and most robust way for this scenario. It handles uneven divisions well.
To split into chunks of a specific maximum row size (chunk_size): Use the list comprehension approach ([df[i:i + chunk_size] for i in range(0, len(df), chunk_size)]) or the equivalent for loop with slicing. This gives you control over the maximum size of each chunk.

Conclusion

Splitting a Pandas DataFrame into smaller chunks is a practical technique for managing large datasets or performing batch operations.

numpy.array_split(df, N) is excellent for dividing a DataFrame into N approximately equal parts.
Slicing within a loop or list comprehension (e.g., df[i:i + chunk_size]) is ideal when you need chunks of a specific maximum row count.

Both methods return a list of DataFrames, which you can then iterate over or access individually to perform your desired operations on each chunk.

Why Split a DataFrame?​

Method 1: Splitting into N Equal(ish) Chunks using numpy.array_split() (Recommended)​

Installation​

How It Works​

Example Usage​

Accessing Individual Chunks​

Method 2: Splitting Every N Rows (Creating Chunks of a Specific Size)​

Using a for Loop and Slicing​

Using List Comprehension and Slicing (Concise)​

Note on DataFrame.iloc for Slici​

Choosing the Right Method​

Conclusion​

Table of Contents

Why Split a DataFrame?

Method 1: Splitting into N Equal(ish) Chunks using `numpy.array_split()` (Recommended)

Installation

How It Works

Example Usage

Accessing Individual Chunks

Method 2: Splitting Every N Rows (Creating Chunks of a Specific Size)

Using a `for` Loop and Slicing

Using List Comprehension and Slicing (Concise)

Note on `DataFrame.iloc` for Slici

Choosing the Right Method

Conclusion