Python Pandas: How to Load a Massive File as Small Chunks in Pandas
When working with datasets that are too large to fit into memory, loading the entire file at once will either crash your program or drastically slow down your system. Pandas provides a built-in solution: the chunksize parameter, which lets you process large files in small, memory-friendly pieces.
In this guide, we'll show you how to load and process massive files in chunks, combine results, and optimize memory usage.
Quick Answer: Use chunksize in read_csv()
import pandas as pd
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
print(f"Chunk shape: {chunk.shape}")
# Process each chunk here
This reads the file 10,000 rows at a time instead of loading all rows into memory simultaneously.
Why Use Chunking?
Consider a CSV file with 10 million rows where each row is about 1 KB:
| Approach | Memory Used | Result |
|---|---|---|
pd.read_csv('file.csv') | ~10 GB at once | Likely crashes or freezes |
pd.read_csv('file.csv', chunksize=50000) | ~50 MB at a time | Processes smoothly |
Chunking is essential when:
- Your file is larger than available RAM.
- You need to transform and save data without holding everything in memory.
- You want to aggregate statistics from a massive dataset.
- You're processing data in a memory-constrained environment (e.g., small servers, containers).
How Chunking Works
When you pass chunksize to read_csv(), Pandas returns a TextFileReader iterator instead of a DataFrame. Each iteration yields a DataFrame containing the specified number of rows:
import pandas as pd
reader = pd.read_csv('large_file.csv', chunksize=10000)
print(type(reader)) # <class 'pandas.io.parsers.readers.TextFileReader'>
# Each iteration gives a DataFrame
chunk = next(reader)
print(type(chunk)) # <class 'pandas.core.frame.DataFrame'>
print(chunk.shape) # (10000, num_columns)
Common Use Cases
1. Aggregating Statistics from a Large File
Compute the total, count, and average without loading the entire file:
import pandas as pd
total_sum = 0
total_count = 0
for chunk in pd.read_csv('sales_data.csv', chunksize=50000):
total_sum += chunk['amount'].sum()
total_count += len(chunk)
average = total_sum / total_count
print(f"Total records: {total_count:,}")
print(f"Total amount: ${total_sum:,.2f}")
print(f"Average amount: ${average:,.2f}")
2. Filtering and Saving Relevant Data
Extract only the rows you need from a massive file:
import pandas as pd
# Filter and save rows where amount > 1000
first_chunk = True
for chunk in pd.read_csv('transactions.csv', chunksize=25000):
filtered = chunk[chunk['amount'] > 1000]
if not filtered.empty:
filtered.to_csv(
'high_value_transactions.csv',
mode='a',
index=False,
header=first_chunk # Write header only for the first chunk
)
first_chunk = False
print("Filtered file created successfully.")
Setting header=first_chunk ensures the column names are written only once at the top of the output file. Subsequent chunks append data without repeating the header.
3. Combining All Chunks into One DataFrame
If the filtered or processed data is small enough to fit in memory, collect all chunks and concatenate:
import pandas as pd
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=50000):
# Process each chunk (e.g., filter, transform)
processed = chunk[chunk['status'] == 'completed']
chunks.append(processed)
# Combine all processed chunks
result = pd.concat(chunks, ignore_index=True)
print(f"Result shape: {result.shape}")
Only use pd.concat() if the combined result fits in memory. If you're filtering out most of the data, this works well. If you're keeping most rows, stick with chunk-by-chunk processing.
4. Computing Value Counts Across Chunks
import pandas as pd
from collections import Counter
category_counts = Counter()
for chunk in pd.read_csv('products.csv', chunksize=20000):
chunk_counts = chunk['category'].value_counts().to_dict()
category_counts.update(chunk_counts)
# Convert to sorted result
for category, count in sorted(category_counts.items(), key=lambda x: -x[1])[:10]:
print(f" {category}: {count:,}")
5. Inspecting a Large File Without Loading It
Quickly check column names and data types without loading any data rows:
import pandas as pd
# Read only the header (0 data rows)
header = pd.read_csv('large_file.csv', nrows=0)
print(f"Columns: {header.columns.tolist()}")
# Read a small sample to check data types
sample = pd.read_csv('large_file.csv', nrows=5)
print(f"\nData types:\n{sample.dtypes}")
print(f"\nSample:\n{sample}")
Using a Generator for Reusable Chunk Processing
Wrap chunk reading in a generator function for cleaner, reusable code:
import pandas as pd
def read_in_chunks(filepath, chunk_size=10000, **kwargs):
"""Generator that yields DataFrame chunks from a CSV file."""
for chunk in pd.read_csv(filepath, chunksize=chunk_size, **kwargs):
yield chunk
def process_large_file(filepath):
"""Example: compute statistics from a large file."""
total_rows = 0
column_sums = None
for chunk in read_in_chunks(filepath, chunk_size=50000):
total_rows += len(chunk)
numeric_chunk = chunk.select_dtypes(include='number')
if column_sums is None:
column_sums = numeric_chunk.sum()
else:
column_sums += numeric_chunk.sum()
print(f"Total rows: {total_rows:,}")
print(f"\nColumn sums:\n{column_sums}")
print(f"\nColumn means:\n{column_sums / total_rows}")
process_large_file('large_file.csv')
Optimizing Memory Further
Specify Data Types
By default, Pandas infers types which can be memory-heavy. Specify types explicitly to reduce memory usage per chunk:
import pandas as pd
dtypes = {
'id': 'int32',
'name': 'category', # Much smaller than 'object' for repeated strings
'amount': 'float32',
'status': 'category',
}
for chunk in pd.read_csv('data.csv', chunksize=50000, dtype=dtypes):
print(f"Memory usage: {chunk.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# Process chunk
Load Only Needed Columns
import pandas as pd
# Only load 2 columns instead of all 20
for chunk in pd.read_csv('data.csv', chunksize=50000, usecols=['id', 'amount']):
print(chunk.shape) # (50000, 2) instead of (50000, 20)
Combine Multiple Optimizations
import pandas as pd
for chunk in pd.read_csv(
'massive_file.csv',
chunksize=100000,
usecols=['user_id', 'action', 'timestamp', 'value'],
dtype={'user_id': 'int32', 'action': 'category', 'value': 'float32'},
parse_dates=['timestamp'],
):
# Highly optimized chunk processing
daily = chunk.groupby(chunk['timestamp'].dt.date)['value'].sum()
print(daily.head())
Works with Other File Formats Too
Chunking isn't limited to CSV files:
import pandas as pd
# JSON Lines format
for chunk in pd.read_json('data.jsonl', lines=True, chunksize=10000):
print(chunk.shape)
# Fixed-width format
for chunk in pd.read_fwf('data.txt', chunksize=10000):
print(chunk.shape)
For Excel files, read_excel() does not support chunksize. For Parquet or HDF5 files, use their native partitioning capabilities instead.
Quick Reference
| Task | Code |
|---|---|
| Read in chunks | pd.read_csv('file.csv', chunksize=10000) |
| Inspect columns only | pd.read_csv('file.csv', nrows=0).columns |
| Load specific columns | pd.read_csv('file.csv', usecols=['col1', 'col2']) |
| Reduce memory with types | pd.read_csv('file.csv', dtype={'col': 'int32'}) |
| Combine processed chunks | pd.concat(chunks, ignore_index=True) |
| Append to output file | chunk.to_csv('out.csv', mode='a', header=False) |
Conclusion
Loading massive files in Pandas is straightforward with the chunksize parameter in read_csv().
- Instead of loading millions of rows at once, you process manageable pieces, like computing aggregations, filtering data, or writing transformed output incrementally.
- Combine chunking with
usecolsto load only needed columns anddtypespecifications to minimize memory per chunk. - For reusable code, wrap chunk reading in a generator function.
This approach lets you work with files of virtually any size on machines with limited memory.