Python Pandas: How to Create a Pandas DataFrame from a Generator

Generators are a powerful Python feature that produce data on demand rather than storing everything in memory at once. When working with large datasets, streaming APIs, or log files, generators let you process millions of records without exhausting your system's memory. Pandas can consume generators directly, making it straightforward to build DataFrames from data sources that would be impractical to load entirely into a list.

In this guide, you will learn how to create DataFrames from generator functions and expressions, process large files efficiently, and avoid common mistakes that negate the memory benefits of generators.

Creating a DataFrame from a Generator Function

Pass a generator directly to the pd.DataFrame() constructor. When the generator yields dictionaries, Pandas automatically uses the keys as column names:

import pandas as pd

def student_generator():
    """Yield student records one at a time."""
    yield {'Name': 'Alice', 'Score': 85}
    yield {'Name': 'Bob', 'Score': 92}
    yield {'Name': 'Charlie', 'Score': 78}

df = pd.DataFrame(student_generator())

print(df)

Output:

      Name  Score
  Alice     85
    Bob     92
Charlie     78

Each yield statement produces one row. Pandas collects all the yielded values and assembles them into a DataFrame.

Using Generator Expressions

For simple transformations, generator expressions provide a concise syntax. Use parentheses instead of square brackets to create a generator rather than a list:

import pandas as pd

# Generator expression (memory efficient, uses parentheses)
data_gen = ({'Num': x, 'Square': x**2} for x in range(5))

df = pd.DataFrame(data_gen)

print(df)

Output:

   Num  Square
  0       0
  1       1
  2       4
  3       9
  4      16

The key difference from a list comprehension is that the generator does not create all dictionaries upfront. Each one is produced only when Pandas requests the next value.

Yielding Tuples with Explicit Column Names

When yielding tuples instead of dictionaries, specify column names using the columns parameter:

import pandas as pd

def number_gen():
    for i in range(3):
        yield (i, i * 10, i * 100)

df = pd.DataFrame(number_gen(), columns=['A', 'B', 'C'])

print(df)

Output:

Dictionaries are generally preferred because they make the column mapping explicit within the generator itself, but tuples can be more efficient when processing speed matters.

Processing Large Files Line by Line

One of the most practical uses of generators is reading large files without loading the entire contents into memory. The file is processed one line at a time:

import pandas as pd

def parse_log_file(filepath):
    """Parse log entries one at a time."""
    with open(filepath, 'r') as f:
        for line in f:
            parts = line.strip().split(',')
            if len(parts) >= 3:
                yield {
                    'timestamp': parts[0],
                    'level': parts[1],
                    'message': parts[2]
                }

df = pd.DataFrame(parse_log_file('app.log'))

This approach works well for CSV-like files, server logs, and any text-based data source where each line represents a record. The file handle is opened, lines are yielded one by one, and the file is closed automatically when the generator is exhausted.

Streaming Data from Paginated APIs

APIs that return data across multiple pages are a natural fit for generators. The generator handles the pagination logic while the calling code simply receives a stream of records:

import pandas as pd

def fetch_paginated_data(api_client):
    """Yield records from a paginated API endpoint."""
    page = 1
    while True:
        response = api_client.get(page=page)
        if not response['data']:
            break
        for record in response['data']:
            yield record
        page += 1

# Collects all pages into a single DataFrame
df = pd.DataFrame(fetch_paginated_data(client))

The generator fetches one page at a time, yields each record individually, and stops when an empty page signals the end of the data. This keeps only one page in memory at any point during the fetching process.

Common Mistake: Converting to a List First

A frequent error is converting the generator to a list before passing it to Pandas. This loads all the data into memory at once, completely defeating the purpose of using a generator:

import pandas as pd

def data_gen():
    for i in range(1000000):
        yield {'x': i, 'y': i * 2}

# Wrong: loads all 1 million records into a list first
df = pd.DataFrame(list(data_gen()))

# Correct: Pandas consumes the generator directly
df = pd.DataFrame(data_gen())

warning

Wrapping a generator in list() before passing it to pd.DataFrame() forces all values into memory simultaneously. Pass the generator directly to preserve its memory efficiency. Note that Pandas does eventually hold all the data in the resulting DataFrame, but avoiding the intermediate list prevents having two copies in memory during construction.

Important: Generators Are Single-Use

A generator can only be consumed once. After Pandas has read all the values, the generator is exhausted and cannot be reused:

import pandas as pd

def simple_gen():
    yield {'A': 1}
    yield {'A': 2}

gen = simple_gen()

# First use: works fine
df1 = pd.DataFrame(gen)
print(f"First DataFrame: {len(df1)} rows")

# Second use: generator is exhausted, produces empty DataFrame
df2 = pd.DataFrame(gen)
print(f"Second DataFrame: {len(df2)} rows")

Output:

First DataFrame: 2 rows
Second DataFrame: 0 rows

If you need to create multiple DataFrames from the same data, either call the generator function again to create a new generator, or store the result in a variable after the first conversion.

Key Considerations

Aspect	Detail
Single-use	Generators are exhausted after one consumption
Memory	Much lower than lists during the generation phase
Row format	Yield dictionaries for automatic column names
Tuple format	Requires explicit `columns` parameter
Performance	Slightly slower than lists for small datasets

Quick Reference

Pattern	Example
Generator function	`def gen(): yield {'col': val}`
Generator expression	`({'col': x} for x in iterable)`
Create DataFrame	`pd.DataFrame(gen())`
With column names	`pd.DataFrame(gen(), columns=['A', 'B'])`

Generators are ideal for processing large files, streaming API responses, and any data source that would be impractical to load entirely into memory. Yield dictionaries to automatically define column names, and pass the generator directly to pd.DataFrame() without converting to a list first. Remember that generators are single-use, so call the generator function again if you need to create another DataFrame from the same data.

Creating a DataFrame from a Generator Function​

Using Generator Expressions​

Yielding Tuples with Explicit Column Names​

Processing Large Files Line by Line​

Streaming Data from Paginated APIs​

Common Mistake: Converting to a List First​

Important: Generators Are Single-Use​

Key Considerations​

Quick Reference​

Table of Contents

Creating a DataFrame from a Generator Function

Using Generator Expressions

Yielding Tuples with Explicit Column Names

Processing Large Files Line by Line

Streaming Data from Paginated APIs

Common Mistake: Converting to a List First

Important: Generators Are Single-Use

Key Considerations

Quick Reference