Python PySpark: How to Add Suffix or Prefix to All Columns in PySpark in Python
Managing column names is a critical part of PySpark data engineering. When joining DataFrames that share column names or tracking data lineage through ETL pipelines, naming collisions can cause ambiguous references and hard-to-debug errors. Adding uniform prefixes like source_ or suffixes like _raw to all columns prevents these issues and keeps your data clearly organized.
This guide demonstrates efficient, scalable approaches for bulk column renaming in PySpark, explains why some common patterns should be avoided, and provides reusable utility functions you can drop into any project.
Using select() with alias() (Recommended)
The most performant approach uses a list comprehension inside .select(). This creates a single projection in Spark's execution plan, regardless of how many columns you rename:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("ColumnRename").getOrCreate()
df = spark.createDataFrame([
(1, "Alice", 100),
(2, "Bob", 200)
], ["id", "name", "score"])
# Add prefix to all columns
prefix = "src_"
prefixed_df = df.select([col(c).alias(f"{prefix}{c}") for c in df.columns])
prefixed_df.show()
Output:
+------+--------+---------+
|src_id|src_name|src_score|
+------+--------+---------+
| 1| Alice| 100|
| 2| Bob| 200|
+------+--------+---------+
Adding a suffix follows the same pattern with a small adjustment to the f-string:
# Add suffix to all columns
suffix = "_raw"
suffixed_df = df.select([col(c).alias(f"{c}{suffix}") for c in df.columns])
suffixed_df.show()
Output:
+------+--------+---------+
|id_raw|name_raw|score_raw|
+------+--------+---------+
| 1| Alice| 100|
| 2| Bob| 200|
+------+--------+---------+
The select() approach generates a single logical operation in Spark's query plan. This makes it optimal even for DataFrames with hundreds of columns, as Spark's catalyst optimizer processes it as one step rather than many.
Quick Schema Overwrite with toDF()
When you want to replace all column names at once, toDF() provides the most concise syntax. You pass the full list of new names as arguments:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ColumnRename").getOrCreate()
df = spark.createDataFrame([
(1, "Alice", 100),
(2, "Bob", 200)
], ["id", "name", "score"])
# Generate new names with prefix
new_names = [f"tbl_{c}" for c in df.columns]
# Apply all names at once
renamed_df = df.toDF(*new_names)
renamed_df.printSchema()
Output:
root
|-- tbl_id: long (nullable = true)
|-- tbl_name: string (nullable = true)
|-- tbl_score: long (nullable = true)
Suffixes are commonly used for data versioning. For example, df.toDF(*[f"{c}_v2" for c in df.columns]) clearly marks a transformed version of your data, making it easy to distinguish from the original during debugging.
Building Reusable Utility Functions
Wrapping the renaming logic in helper functions ensures consistent usage across your entire codebase and reduces the chance of typos or formatting inconsistencies:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
def add_prefix(df: DataFrame, prefix: str) -> DataFrame:
"""Add a prefix to all column names."""
return df.select([col(c).alias(f"{prefix}{c}") for c in df.columns])
def add_suffix(df: DataFrame, suffix: str) -> DataFrame:
"""Add a suffix to all column names."""
return df.select([col(c).alias(f"{c}{suffix}") for c in df.columns])
def add_affix(df: DataFrame, prefix: str = "", suffix: str = "") -> DataFrame:
"""Add a prefix and/or suffix to all column names."""
return df.select([col(c).alias(f"{prefix}{c}{suffix}") for c in df.columns])
Using these functions is straightforward:
staging_df = add_prefix(df, "stg_")
raw_df = add_suffix(df, "_raw")
combined_df = add_affix(df, prefix="src_", suffix="_v1")
combined_df.show()
Output:
+---------+------------+------------+
|src_id_v1|src_name_v1|src_score_v1|
+---------+------------+------------+
| 1| Alice| 100|
| 2| Bob| 200|
+---------+------------+------------+
Selective Column Renaming
Sometimes you only need to rename specific columns while leaving others unchanged. A conditional expression inside the list comprehension handles this cleanly:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("ColumnRename").getOrCreate()
df = spark.createDataFrame([
(1, "Alice", 100),
(2, "Bob", 200)
], ["id", "name", "score"])
# Only prefix certain columns
columns_to_prefix = ["id", "score"]
renamed_cols = [
col(c).alias(f"new_{c}") if c in columns_to_prefix else col(c)
for c in df.columns
]
selective_df = df.select(renamed_cols)
selective_df.show()
Output:
+------+-----+---------+
|new_id| name|new_score|
+------+-----+---------+
| 1|Alice| 100|
| 2| Bob| 200|
+------+-----+---------+
The name column remains unchanged because it is not in the columns_to_prefix list.
Why You Should Avoid Loop-Based Renaming
A common but problematic approach is to loop through columns and call withColumnRenamed on each one:
# Do NOT do this for bulk renaming
result = df
for c in df.columns:
result = result.withColumnRenamed(c, f"new_{c}")
This works functionally, but each call to withColumnRenamed adds a new layer to Spark's logical plan. For a DataFrame with a handful of columns, this is barely noticeable. For DataFrames with 100 or more columns, it creates deeply nested plans that cause serious problems.
Iterative withColumnRenamed calls on wide DataFrames can cause:
- StackOverflowError during plan analysis due to excessive recursion depth
- Significantly longer query optimization time as the catalyst optimizer processes each nested layer
- Increased memory pressure on the driver node
Always use select() with alias() or toDF() for bulk column renaming.
A single withColumnRenamed call is perfectly fine when you need to rename just one or two specific columns. The problem only arises when it is used in a loop for bulk operations.
Method Comparison
| Method | Performance | Best Use Case |
|---|---|---|
df.select([col(c).alias(...)]) | Excellent | Large schemas, production ETL pipelines |
df.toDF(*new_names) | Excellent | Complete schema replacement |
Single withColumnRenamed | Good | Renaming one or two specific columns |
withColumnRenamed in a loop | Poor | Avoid for bulk operations |
Practical Example: Preventing Column Collisions Before a Join
One of the most practical uses of bulk prefixing is preparing DataFrames for a join. When two DataFrames share column names, joining them without renaming first creates ambiguous column references that are difficult to resolve after the fact:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("ColumnRename").getOrCreate()
def add_prefix(df, prefix):
return df.select([col(c).alias(f"{prefix}{c}") for c in df.columns])
# Two DataFrames with overlapping column names
orders = spark.createDataFrame([(1, "2024-01-01")], ["id", "date"])
customers = spark.createDataFrame([(1, "Alice")], ["id", "name"])
# Add distinct prefixes to avoid ambiguity
orders_prefixed = add_prefix(orders, "order_")
customers_prefixed = add_prefix(customers, "cust_")
# Join without any column name collisions
joined = orders_prefixed.join(
customers_prefixed,
orders_prefixed.order_id == customers_prefixed.cust_id
)
joined.show()
Output:
+--------+----------+-------+---------+
|order_id|order_date|cust_id|cust_name|
+--------+----------+-------+---------+
| 1|2024-01-01| 1| Alice|
+--------+----------+-------+---------+
Without the prefixes, both DataFrames would have a column called id, and any subsequent reference to id after the join would be ambiguous. Prefixing before the join eliminates this problem entirely.
Summary
When you need to add prefixes or suffixes to all columns in a PySpark DataFrame, use df.select() with col().alias() for the best combination of performance, readability, and flexibility.
- Use
df.toDF()when you want a concise one-liner for complete schema replacement. - Wrap either approach in reusable utility functions to maintain consistency across your codebase.
- Avoid calling
withColumnRenamedin a loop for bulk operations, as it creates deeply nested execution plans that degrade performance and can cause errors on wide DataFrames. - For selective renaming of just one or two columns, a single
withColumnRenamedcall remains a clean and efficient choice.