How to Duplicate a Row N Times in a PySpark DataFrame
When working with PySpark DataFrames, you may need to duplicate rows, whether for data augmentation, testing with larger datasets, generating repeated records based on a column value, or creating weighted samples. PySpark provides several approaches to replicate rows efficiently across distributed data.
In this guide, you'll learn multiple methods to duplicate rows in a PySpark DataFrame, from column-value-based repetition to fixed N-time duplication.
Setting Up the Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DuplicateRows").getOrCreate()
data = [
("Alice", 2, "Engineering"),
("Bob", 3, "Marketing"),
("Charlie", 1, "Sales"),
("Diana", 4, "Engineering")
]
columns = ["Name", "Repeat_Count", "Department"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
+-------+------------+-----------+
Method 1: Repeating Rows Based on a Column Value
When you want each row duplicated a number of times specified by a column value, use array_repeat() with explode():
from pyspark.sql.functions import expr
# Repeat each row based on the 'Repeat_Count' column
df_repeated = df.withColumn(
"Repeat_Count",
expr("explode(array_repeat(Repeat_Count, int(Repeat_Count)))")
)
df_repeated.show()
Output:
+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
+-------+------------+-----------+
How it works:
array_repeat(Repeat_Count, int(Repeat_Count))creates an array containing the value repeated N times (e.g.,[2, 2]for Alice).explode()converts each array element into a separate row.- The result is each row appearing as many times as specified by its
Repeat_Countvalue.
Method 2: Duplicating All Rows a Fixed N Times
To duplicate every row the same number of times, create an array of size N and explode it:
from pyspark.sql.functions import explode, array_repeat, lit, col
n = 3 # Number of times to duplicate each row
df_duplicated = df.withColumn("temp", explode(array_repeat(lit(1), n)))
df_duplicated = df_duplicated.drop("temp")
df_duplicated.show()
Output:
+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
+-------+------------+-----------+
Each row appears exactly 3 times. The temporary column created by explode() is dropped since it's not needed.
Using lit(1) creates a dummy value to repeat. The actual value doesn't matter, explode() simply generates N rows. You could use lit("x") or any constant.
Method 3: Using union() in a Loop
A straightforward approach is to union the DataFrame with itself N times:
n = 3 # Number of total copies (original + duplicates)
df_result = df
for _ in range(n - 1):
df_result = df_result.union(df)
print(f"Original rows: {df.count()}")
print(f"After duplication: {df_result.count()}")
df_result.orderBy("Name").show()
Output:
Original rows: 4
After duplication: 12
+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
+-------+------------+-----------+
The union() loop approach creates a chain of DataFrames in the execution plan. For small values of N (2-5), this is fine. For large N values, the execution plan becomes deeply nested and can cause performance issues or stack overflow errors. Prefer the explode(array_repeat()) method for large N.
Method 4: Using flatMap() on RDD
For maximum flexibility, convert to RDD and use flatMap() to replicate rows:
n = 3
# Convert to RDD, replicate, and convert back
rdd_repeated = df.rdd.flatMap(lambda row: [row] * n)
df_result = spark.createDataFrame(rdd_repeated, df.schema)
df_result.show()
Output:
+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
+-------+------------+-----------+
This approach is simple and works well, but converting between DataFrame and RDD adds some overhead.
Method 5: Duplicating a Specific Row
To duplicate a specific row (not all rows), filter it first and then union:
from functools import reduce
from pyspark.sql import DataFrame
# Select the row to duplicate
target_row = df.filter(df.Name == "Alice")
n = 4 # Number of copies
# Create N copies and union them
copies = [target_row] * n
duplicated_row = reduce(DataFrame.union, copies)
# Combine with original DataFrame
df_result = df.union(duplicated_row)
df_result.show()
Output:
+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
+-------+------------+-----------+
Alice's row now appears 5 times total (1 original + 4 copies).
Adding a Copy Identifier
To track which copy each row belongs to, add an index column during duplication:
from pyspark.sql.functions import explode, sequence, lit
n = 3
df_with_copy_id = df.withColumn(
"copy_id",
explode(sequence(lit(1), lit(n)))
)
df_with_copy_id.orderBy("Name", "copy_id").show()
Output:
+-------+------------+-----------+-------+
| Name|Repeat_Count| Department|copy_id|
+-------+------------+-----------+-------+
| Alice| 2|Engineering| 1|
| Alice| 2|Engineering| 2|
| Alice| 2|Engineering| 3|
| Bob| 3| Marketing| 1|
| Bob| 3| Marketing| 2|
| Bob| 3| Marketing| 3|
|Charlie| 1| Sales| 1|
|Charlie| 1| Sales| 2|
|Charlie| 1| Sales| 3|
| Diana| 4|Engineering| 1|
| Diana| 4|Engineering| 2|
| Diana| 4|Engineering| 3|
+-------+------------+-----------+-------+
The copy_id column identifies each copy (1 = original, 2 = first copy, etc.).
Quick Comparison of Methods
| Method | Fixed N | Column-Based N | Specific Row | Performance |
|---|---|---|---|---|
explode(array_repeat()) | ✅ | ✅ | ❌ | ⚡ Best |
union() loop | ✅ | ❌ | ✅ | 🔶 OK for small N |
flatMap() on RDD | ✅ | ✅ | ❌ | 🔶 Moderate (RDD overhead) |
| Filter + union | ❌ | ❌ | ✅ | 🔶 Moderate |
Conclusion
PySpark provides several ways to duplicate rows in a DataFrame:
- Use
explode(array_repeat())for the most efficient and scalable approach. It works natively with Spark's catalyst optimizer and handles both fixed N and column-based repetition. - Use
union()in a loop for simple cases with small N values or when duplicating specific rows. - Use
flatMap()on RDD when you need maximum flexibility in the replication logic. - Add a
copy_idcolumn withsequence()when you need to track which copy each row represents.
For most production use cases, the explode(array_repeat()) approach is the recommended method due to its performance and simplicity.