Skip to main content

Python Pandas: How to Create a Pandas DataFrame from a Generator

Generators are a powerful Python feature that produce data on demand rather than storing everything in memory at once. When working with large datasets, streaming APIs, or log files, generators let you process millions of records without exhausting your system's memory. Pandas can consume generators directly, making it straightforward to build DataFrames from data sources that would be impractical to load entirely into a list.

In this guide, you will learn how to create DataFrames from generator functions and expressions, process large files efficiently, and avoid common mistakes that negate the memory benefits of generators.

Creating a DataFrame from a Generator Function

Pass a generator directly to the pd.DataFrame() constructor. When the generator yields dictionaries, Pandas automatically uses the keys as column names:

import pandas as pd

def student_generator():
"""Yield student records one at a time."""
yield {'Name': 'Alice', 'Score': 85}
yield {'Name': 'Bob', 'Score': 92}
yield {'Name': 'Charlie', 'Score': 78}

df = pd.DataFrame(student_generator())

print(df)

Output:

      Name  Score
0 Alice 85
1 Bob 92
2 Charlie 78

Each yield statement produces one row. Pandas collects all the yielded values and assembles them into a DataFrame.

Using Generator Expressions

For simple transformations, generator expressions provide a concise syntax. Use parentheses instead of square brackets to create a generator rather than a list:

import pandas as pd

# Generator expression (memory efficient, uses parentheses)
data_gen = ({'Num': x, 'Square': x**2} for x in range(5))

df = pd.DataFrame(data_gen)

print(df)

Output:

   Num  Square
0 0 0
1 1 1
2 2 4
3 3 9
4 4 16

The key difference from a list comprehension is that the generator does not create all dictionaries upfront. Each one is produced only when Pandas requests the next value.

Yielding Tuples with Explicit Column Names

When yielding tuples instead of dictionaries, specify column names using the columns parameter:

import pandas as pd

def number_gen():
for i in range(3):
yield (i, i * 10, i * 100)

df = pd.DataFrame(number_gen(), columns=['A', 'B', 'C'])

print(df)

Output:

   A   B    C
0 0 0 0
1 1 10 100
2 2 20 200

Dictionaries are generally preferred because they make the column mapping explicit within the generator itself, but tuples can be more efficient when processing speed matters.

Processing Large Files Line by Line

One of the most practical uses of generators is reading large files without loading the entire contents into memory. The file is processed one line at a time:

import pandas as pd

def parse_log_file(filepath):
"""Parse log entries one at a time."""
with open(filepath, 'r') as f:
for line in f:
parts = line.strip().split(',')
if len(parts) >= 3:
yield {
'timestamp': parts[0],
'level': parts[1],
'message': parts[2]
}

df = pd.DataFrame(parse_log_file('app.log'))

This approach works well for CSV-like files, server logs, and any text-based data source where each line represents a record. The file handle is opened, lines are yielded one by one, and the file is closed automatically when the generator is exhausted.

Streaming Data from Paginated APIs

APIs that return data across multiple pages are a natural fit for generators. The generator handles the pagination logic while the calling code simply receives a stream of records:

import pandas as pd

def fetch_paginated_data(api_client):
"""Yield records from a paginated API endpoint."""
page = 1
while True:
response = api_client.get(page=page)
if not response['data']:
break
for record in response['data']:
yield record
page += 1

# Collects all pages into a single DataFrame
df = pd.DataFrame(fetch_paginated_data(client))

The generator fetches one page at a time, yields each record individually, and stops when an empty page signals the end of the data. This keeps only one page in memory at any point during the fetching process.

Common Mistake: Converting to a List First

A frequent error is converting the generator to a list before passing it to Pandas. This loads all the data into memory at once, completely defeating the purpose of using a generator:

import pandas as pd

def data_gen():
for i in range(1000000):
yield {'x': i, 'y': i * 2}

# Wrong: loads all 1 million records into a list first
df = pd.DataFrame(list(data_gen()))

# Correct: Pandas consumes the generator directly
df = pd.DataFrame(data_gen())
warning

Wrapping a generator in list() before passing it to pd.DataFrame() forces all values into memory simultaneously. Pass the generator directly to preserve its memory efficiency. Note that Pandas does eventually hold all the data in the resulting DataFrame, but avoiding the intermediate list prevents having two copies in memory during construction.

Important: Generators Are Single-Use

A generator can only be consumed once. After Pandas has read all the values, the generator is exhausted and cannot be reused:

import pandas as pd

def simple_gen():
yield {'A': 1}
yield {'A': 2}

gen = simple_gen()

# First use: works fine
df1 = pd.DataFrame(gen)
print(f"First DataFrame: {len(df1)} rows")

# Second use: generator is exhausted, produces empty DataFrame
df2 = pd.DataFrame(gen)
print(f"Second DataFrame: {len(df2)} rows")

Output:

First DataFrame: 2 rows
Second DataFrame: 0 rows

If you need to create multiple DataFrames from the same data, either call the generator function again to create a new generator, or store the result in a variable after the first conversion.

Key Considerations

AspectDetail
Single-useGenerators are exhausted after one consumption
MemoryMuch lower than lists during the generation phase
Row formatYield dictionaries for automatic column names
Tuple formatRequires explicit columns parameter
PerformanceSlightly slower than lists for small datasets

Quick Reference

PatternExample
Generator functiondef gen(): yield {'col': val}
Generator expression({'col': x} for x in iterable)
Create DataFramepd.DataFrame(gen())
With column namespd.DataFrame(gen(), columns=['A', 'B'])

Generators are ideal for processing large files, streaming API responses, and any data source that would be impractical to load entirely into memory. Yield dictionaries to automatically define column names, and pass the generator directly to pd.DataFrame() without converting to a list first. Remember that generators are single-use, so call the generator function again if you need to create another DataFrame from the same data.