Python PySpark: How to Add Suffix or Prefix to All Columns in PySpark in Python

Managing column names is a critical part of PySpark data engineering. When joining DataFrames that share column names or tracking data lineage through ETL pipelines, naming collisions can cause ambiguous references and hard-to-debug errors. Adding uniform prefixes like source_ or suffixes like _raw to all columns prevents these issues and keeps your data clearly organized.

This guide demonstrates efficient, scalable approaches for bulk column renaming in PySpark, explains why some common patterns should be avoided, and provides reusable utility functions you can drop into any project.

Using select() with alias() (Recommended)

The most performant approach uses a list comprehension inside .select(). This creates a single projection in Spark's execution plan, regardless of how many columns you rename:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("ColumnRename").getOrCreate()

df = spark.createDataFrame([
    (1, "Alice", 100),
    (2, "Bob", 200)
], ["id", "name", "score"])

# Add prefix to all columns
prefix = "src_"
prefixed_df = df.select([col(c).alias(f"{prefix}{c}") for c in df.columns])

prefixed_df.show()

Output:

+------+--------+---------+
|src_id|src_name|src_score|
+------+--------+---------+
|     1|   Alice|      100|
|     2|     Bob|      200|
+------+--------+---------+

Adding a suffix follows the same pattern with a small adjustment to the f-string:

# Add suffix to all columns
suffix = "_raw"
suffixed_df = df.select([col(c).alias(f"{c}{suffix}") for c in df.columns])

suffixed_df.show()

Output:

+------+--------+---------+
|id_raw|name_raw|score_raw|
+------+--------+---------+
|     1|   Alice|      100|
|     2|     Bob|      200|
+------+--------+---------+

info

The select() approach generates a single logical operation in Spark's query plan. This makes it optimal even for DataFrames with hundreds of columns, as Spark's catalyst optimizer processes it as one step rather than many.

Quick Schema Overwrite with toDF()

When you want to replace all column names at once, toDF() provides the most concise syntax. You pass the full list of new names as arguments:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ColumnRename").getOrCreate()

df = spark.createDataFrame([
    (1, "Alice", 100),
    (2, "Bob", 200)
], ["id", "name", "score"])

# Generate new names with prefix
new_names = [f"tbl_{c}" for c in df.columns]

# Apply all names at once
renamed_df = df.toDF(*new_names)

renamed_df.printSchema()

Output:

root
 |-- tbl_id: long (nullable = true)
 |-- tbl_name: string (nullable = true)
 |-- tbl_score: long (nullable = true)

tip

Suffixes are commonly used for data versioning. For example, df.toDF(*[f"{c}_v2" for c in df.columns]) clearly marks a transformed version of your data, making it easy to distinguish from the original during debugging.

Building Reusable Utility Functions

Wrapping the renaming logic in helper functions ensures consistent usage across your entire codebase and reduces the chance of typos or formatting inconsistencies:

from pyspark.sql import DataFrame
from pyspark.sql.functions import col

def add_prefix(df: DataFrame, prefix: str) -> DataFrame:
    """Add a prefix to all column names."""
    return df.select([col(c).alias(f"{prefix}{c}") for c in df.columns])

def add_suffix(df: DataFrame, suffix: str) -> DataFrame:
    """Add a suffix to all column names."""
    return df.select([col(c).alias(f"{c}{suffix}") for c in df.columns])

def add_affix(df: DataFrame, prefix: str = "", suffix: str = "") -> DataFrame:
    """Add a prefix and/or suffix to all column names."""
    return df.select([col(c).alias(f"{prefix}{c}{suffix}") for c in df.columns])

Using these functions is straightforward:

staging_df = add_prefix(df, "stg_")
raw_df = add_suffix(df, "_raw")
combined_df = add_affix(df, prefix="src_", suffix="_v1")

combined_df.show()

Output:

+---------+------------+------------+
|src_id_v1|src_name_v1|src_score_v1|
+---------+------------+------------+
|        1|       Alice|         100|
|        2|         Bob|         200|
+---------+------------+------------+

Selective Column Renaming

Sometimes you only need to rename specific columns while leaving others unchanged. A conditional expression inside the list comprehension handles this cleanly:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("ColumnRename").getOrCreate()

df = spark.createDataFrame([
    (1, "Alice", 100),
    (2, "Bob", 200)
], ["id", "name", "score"])

# Only prefix certain columns
columns_to_prefix = ["id", "score"]

renamed_cols = [
    col(c).alias(f"new_{c}") if c in columns_to_prefix else col(c)
    for c in df.columns
]

selective_df = df.select(renamed_cols)
selective_df.show()

Output:

+------+-----+---------+
|new_id| name|new_score|
+------+-----+---------+
|     1|Alice|      100|
|     2|  Bob|      200|
+------+-----+---------+

The name column remains unchanged because it is not in the columns_to_prefix list.

Why You Should Avoid Loop-Based Renaming

A common but problematic approach is to loop through columns and call withColumnRenamed on each one:

# Do NOT do this for bulk renaming
result = df
for c in df.columns:
    result = result.withColumnRenamed(c, f"new_{c}")

This works functionally, but each call to withColumnRenamed adds a new layer to Spark's logical plan. For a DataFrame with a handful of columns, this is barely noticeable. For DataFrames with 100 or more columns, it creates deeply nested plans that cause serious problems.

warning

Iterative withColumnRenamed calls on wide DataFrames can cause:

StackOverflowError during plan analysis due to excessive recursion depth
Significantly longer query optimization time as the catalyst optimizer processes each nested layer
Increased memory pressure on the driver node

Always use select() with alias() or toDF() for bulk column renaming.

A single withColumnRenamed call is perfectly fine when you need to rename just one or two specific columns. The problem only arises when it is used in a loop for bulk operations.

Method Comparison

Method	Performance	Best Use Case
`df.select([col(c).alias(...)])`	Excellent	Large schemas, production ETL pipelines
`df.toDF(*new_names)`	Excellent	Complete schema replacement
Single `withColumnRenamed`	Good	Renaming one or two specific columns
`withColumnRenamed` in a loop	Poor	Avoid for bulk operations

Practical Example: Preventing Column Collisions Before a Join

One of the most practical uses of bulk prefixing is preparing DataFrames for a join. When two DataFrames share column names, joining them without renaming first creates ambiguous column references that are difficult to resolve after the fact:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("ColumnRename").getOrCreate()

def add_prefix(df, prefix):
    return df.select([col(c).alias(f"{prefix}{c}") for c in df.columns])

# Two DataFrames with overlapping column names
orders = spark.createDataFrame([(1, "2024-01-01")], ["id", "date"])
customers = spark.createDataFrame([(1, "Alice")], ["id", "name"])

# Add distinct prefixes to avoid ambiguity
orders_prefixed = add_prefix(orders, "order_")
customers_prefixed = add_prefix(customers, "cust_")

# Join without any column name collisions
joined = orders_prefixed.join(
    customers_prefixed,
    orders_prefixed.order_id == customers_prefixed.cust_id
)

joined.show()

Output:

+--------+----------+-------+---------+
|order_id|order_date|cust_id|cust_name|
+--------+----------+-------+---------+
|       1|2024-01-01|      1|    Alice|
+--------+----------+-------+---------+

Without the prefixes, both DataFrames would have a column called id, and any subsequent reference to id after the join would be ambiguous. Prefixing before the join eliminates this problem entirely.

Summary

When you need to add prefixes or suffixes to all columns in a PySpark DataFrame, use df.select() with col().alias() for the best combination of performance, readability, and flexibility.

Use df.toDF() when you want a concise one-liner for complete schema replacement.
Wrap either approach in reusable utility functions to maintain consistency across your codebase.
Avoid calling withColumnRenamed in a loop for bulk operations, as it creates deeply nested execution plans that degrade performance and can cause errors on wide DataFrames.
For selective renaming of just one or two columns, a single withColumnRenamed call remains a clean and efficient choice.

Using select() with alias() (Recommended)​

Quick Schema Overwrite with toDF()​

Building Reusable Utility Functions​

Selective Column Renaming​

Why You Should Avoid Loop-Based Renaming​

Method Comparison​

Practical Example: Preventing Column Collisions Before a Join​

Summary​

Table of Contents

Using select() with alias() (Recommended)

Quick Schema Overwrite with toDF()

Building Reusable Utility Functions

Selective Column Renaming

Why You Should Avoid Loop-Based Renaming

Method Comparison

Practical Example: Preventing Column Collisions Before a Join

Summary