Skip to main content

Python Pandas: How to Load a Massive File as Small Chunks in Pandas

When working with datasets that are too large to fit into memory, loading the entire file at once will either crash your program or drastically slow down your system. Pandas provides a built-in solution: the chunksize parameter, which lets you process large files in small, memory-friendly pieces.

In this guide, we'll show you how to load and process massive files in chunks, combine results, and optimize memory usage.

Quick Answer: Use chunksize in read_csv()

import pandas as pd

for chunk in pd.read_csv('large_file.csv', chunksize=10000):
print(f"Chunk shape: {chunk.shape}")
# Process each chunk here

This reads the file 10,000 rows at a time instead of loading all rows into memory simultaneously.

Why Use Chunking?

Consider a CSV file with 10 million rows where each row is about 1 KB:

ApproachMemory UsedResult
pd.read_csv('file.csv')~10 GB at onceLikely crashes or freezes
pd.read_csv('file.csv', chunksize=50000)~50 MB at a timeProcesses smoothly

Chunking is essential when:

  • Your file is larger than available RAM.
  • You need to transform and save data without holding everything in memory.
  • You want to aggregate statistics from a massive dataset.
  • You're processing data in a memory-constrained environment (e.g., small servers, containers).

How Chunking Works

When you pass chunksize to read_csv(), Pandas returns a TextFileReader iterator instead of a DataFrame. Each iteration yields a DataFrame containing the specified number of rows:

import pandas as pd

reader = pd.read_csv('large_file.csv', chunksize=10000)
print(type(reader)) # <class 'pandas.io.parsers.readers.TextFileReader'>

# Each iteration gives a DataFrame
chunk = next(reader)
print(type(chunk)) # <class 'pandas.core.frame.DataFrame'>
print(chunk.shape) # (10000, num_columns)

Common Use Cases

1. Aggregating Statistics from a Large File

Compute the total, count, and average without loading the entire file:

import pandas as pd

total_sum = 0
total_count = 0

for chunk in pd.read_csv('sales_data.csv', chunksize=50000):
total_sum += chunk['amount'].sum()
total_count += len(chunk)

average = total_sum / total_count
print(f"Total records: {total_count:,}")
print(f"Total amount: ${total_sum:,.2f}")
print(f"Average amount: ${average:,.2f}")

2. Filtering and Saving Relevant Data

Extract only the rows you need from a massive file:

import pandas as pd

# Filter and save rows where amount > 1000
first_chunk = True

for chunk in pd.read_csv('transactions.csv', chunksize=25000):
filtered = chunk[chunk['amount'] > 1000]

if not filtered.empty:
filtered.to_csv(
'high_value_transactions.csv',
mode='a',
index=False,
header=first_chunk # Write header only for the first chunk
)
first_chunk = False

print("Filtered file created successfully.")
tip

Setting header=first_chunk ensures the column names are written only once at the top of the output file. Subsequent chunks append data without repeating the header.

3. Combining All Chunks into One DataFrame

If the filtered or processed data is small enough to fit in memory, collect all chunks and concatenate:

import pandas as pd

chunks = []

for chunk in pd.read_csv('large_file.csv', chunksize=50000):
# Process each chunk (e.g., filter, transform)
processed = chunk[chunk['status'] == 'completed']
chunks.append(processed)

# Combine all processed chunks
result = pd.concat(chunks, ignore_index=True)
print(f"Result shape: {result.shape}")
caution

Only use pd.concat() if the combined result fits in memory. If you're filtering out most of the data, this works well. If you're keeping most rows, stick with chunk-by-chunk processing.

4. Computing Value Counts Across Chunks

import pandas as pd
from collections import Counter

category_counts = Counter()

for chunk in pd.read_csv('products.csv', chunksize=20000):
chunk_counts = chunk['category'].value_counts().to_dict()
category_counts.update(chunk_counts)

# Convert to sorted result
for category, count in sorted(category_counts.items(), key=lambda x: -x[1])[:10]:
print(f" {category}: {count:,}")

5. Inspecting a Large File Without Loading It

Quickly check column names and data types without loading any data rows:

import pandas as pd

# Read only the header (0 data rows)
header = pd.read_csv('large_file.csv', nrows=0)
print(f"Columns: {header.columns.tolist()}")

# Read a small sample to check data types
sample = pd.read_csv('large_file.csv', nrows=5)
print(f"\nData types:\n{sample.dtypes}")
print(f"\nSample:\n{sample}")

Using a Generator for Reusable Chunk Processing

Wrap chunk reading in a generator function for cleaner, reusable code:

import pandas as pd


def read_in_chunks(filepath, chunk_size=10000, **kwargs):
"""Generator that yields DataFrame chunks from a CSV file."""
for chunk in pd.read_csv(filepath, chunksize=chunk_size, **kwargs):
yield chunk


def process_large_file(filepath):
"""Example: compute statistics from a large file."""
total_rows = 0
column_sums = None

for chunk in read_in_chunks(filepath, chunk_size=50000):
total_rows += len(chunk)

numeric_chunk = chunk.select_dtypes(include='number')
if column_sums is None:
column_sums = numeric_chunk.sum()
else:
column_sums += numeric_chunk.sum()

print(f"Total rows: {total_rows:,}")
print(f"\nColumn sums:\n{column_sums}")
print(f"\nColumn means:\n{column_sums / total_rows}")


process_large_file('large_file.csv')

Optimizing Memory Further

Specify Data Types

By default, Pandas infers types which can be memory-heavy. Specify types explicitly to reduce memory usage per chunk:

import pandas as pd

dtypes = {
'id': 'int32',
'name': 'category', # Much smaller than 'object' for repeated strings
'amount': 'float32',
'status': 'category',
}

for chunk in pd.read_csv('data.csv', chunksize=50000, dtype=dtypes):
print(f"Memory usage: {chunk.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# Process chunk

Load Only Needed Columns

import pandas as pd

# Only load 2 columns instead of all 20
for chunk in pd.read_csv('data.csv', chunksize=50000, usecols=['id', 'amount']):
print(chunk.shape) # (50000, 2) instead of (50000, 20)

Combine Multiple Optimizations

import pandas as pd

for chunk in pd.read_csv(
'massive_file.csv',
chunksize=100000,
usecols=['user_id', 'action', 'timestamp', 'value'],
dtype={'user_id': 'int32', 'action': 'category', 'value': 'float32'},
parse_dates=['timestamp'],
):
# Highly optimized chunk processing
daily = chunk.groupby(chunk['timestamp'].dt.date)['value'].sum()
print(daily.head())

Works with Other File Formats Too

Chunking isn't limited to CSV files:

import pandas as pd

# JSON Lines format
for chunk in pd.read_json('data.jsonl', lines=True, chunksize=10000):
print(chunk.shape)

# Fixed-width format
for chunk in pd.read_fwf('data.txt', chunksize=10000):
print(chunk.shape)
info

For Excel files, read_excel() does not support chunksize. For Parquet or HDF5 files, use their native partitioning capabilities instead.

Quick Reference

TaskCode
Read in chunkspd.read_csv('file.csv', chunksize=10000)
Inspect columns onlypd.read_csv('file.csv', nrows=0).columns
Load specific columnspd.read_csv('file.csv', usecols=['col1', 'col2'])
Reduce memory with typespd.read_csv('file.csv', dtype={'col': 'int32'})
Combine processed chunkspd.concat(chunks, ignore_index=True)
Append to output filechunk.to_csv('out.csv', mode='a', header=False)

Conclusion

Loading massive files in Pandas is straightforward with the chunksize parameter in read_csv().

  • Instead of loading millions of rows at once, you process manageable pieces, like computing aggregations, filtering data, or writing transformed output incrementally.
  • Combine chunking with usecols to load only needed columns and dtype specifications to minimize memory per chunk.
  • For reusable code, wrap chunk reading in a generator function.

This approach lets you work with files of virtually any size on machines with limited memory.