How to Duplicate a Row N Times in a PySpark DataFrame

When working with PySpark DataFrames, you may need to duplicate rows, whether for data augmentation, testing with larger datasets, generating repeated records based on a column value, or creating weighted samples. PySpark provides several approaches to replicate rows efficiently across distributed data.

In this guide, you'll learn multiple methods to duplicate rows in a PySpark DataFrame, from column-value-based repetition to fixed N-time duplication.

Setting Up the Example

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DuplicateRows").getOrCreate()

data = [
    ("Alice", 2, "Engineering"),
    ("Bob", 3, "Marketing"),
    ("Charlie", 1, "Sales"),
    ("Diana", 4, "Engineering")
]

columns = ["Name", "Repeat_Count", "Department"]
df = spark.createDataFrame(data, columns)

df.show()

Output:

+-------+------------+-----------+
|   Name|Repeat_Count| Department|
+-------+------------+-----------+
|  Alice|           2|Engineering|
|    Bob|           3|  Marketing|
|Charlie|           1|      Sales|
|  Diana|           4|Engineering|
+-------+------------+-----------+

Method 1: Repeating Rows Based on a Column Value

When you want each row duplicated a number of times specified by a column value, use array_repeat() with explode():

from pyspark.sql.functions import expr

# Repeat each row based on the 'Repeat_Count' column
df_repeated = df.withColumn(
    "Repeat_Count",
    expr("explode(array_repeat(Repeat_Count, int(Repeat_Count)))")
)

df_repeated.show()

Output:

+-------+------------+-----------+
|   Name|Repeat_Count| Department|
+-------+------------+-----------+
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
|    Bob|           3|  Marketing|
|    Bob|           3|  Marketing|
|    Bob|           3|  Marketing|
|Charlie|           1|      Sales|
|  Diana|           4|Engineering|
|  Diana|           4|Engineering|
|  Diana|           4|Engineering|
|  Diana|           4|Engineering|
+-------+------------+-----------+

How it works:

array_repeat(Repeat_Count, int(Repeat_Count)) creates an array containing the value repeated N times (e.g., [2, 2] for Alice).
explode() converts each array element into a separate row.
The result is each row appearing as many times as specified by its Repeat_Count value.

Method 2: Duplicating All Rows a Fixed N Times

To duplicate every row the same number of times, create an array of size N and explode it:

from pyspark.sql.functions import explode, array_repeat, lit, col

n = 3  # Number of times to duplicate each row

df_duplicated = df.withColumn("temp", explode(array_repeat(lit(1), n)))
df_duplicated = df_duplicated.drop("temp")

df_duplicated.show()

Output:

+-------+------------+-----------+
|   Name|Repeat_Count| Department|
+-------+------------+-----------+
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
|    Bob|           3|  Marketing|
|    Bob|           3|  Marketing|
|    Bob|           3|  Marketing|
|Charlie|           1|      Sales|
|Charlie|           1|      Sales|
|Charlie|           1|      Sales|
|  Diana|           4|Engineering|
|  Diana|           4|Engineering|
|  Diana|           4|Engineering|
+-------+------------+-----------+

Each row appears exactly 3 times. The temporary column created by explode() is dropped since it's not needed.

tip

Using lit(1) creates a dummy value to repeat. The actual value doesn't matter, explode() simply generates N rows. You could use lit("x") or any constant.

Method 3: Using `union()` in a Loop

A straightforward approach is to union the DataFrame with itself N times:

n = 3  # Number of total copies (original + duplicates)

df_result = df
for _ in range(n - 1):
    df_result = df_result.union(df)

print(f"Original rows: {df.count()}")
print(f"After duplication: {df_result.count()}")

df_result.orderBy("Name").show()

Output:

Original rows: 4
After duplication: 12

+-------+------------+-----------+
|   Name|Repeat_Count| Department|
+-------+------------+-----------+
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
|    Bob|           3|  Marketing|
|    Bob|           3|  Marketing|
|    Bob|           3|  Marketing|
|Charlie|           1|      Sales|
|Charlie|           1|      Sales|
|Charlie|           1|      Sales|
|  Diana|           4|Engineering|
|  Diana|           4|Engineering|
|  Diana|           4|Engineering|
+-------+------------+-----------+

Performance Note

The union() loop approach creates a chain of DataFrames in the execution plan. For small values of N (2-5), this is fine. For large N values, the execution plan becomes deeply nested and can cause performance issues or stack overflow errors. Prefer the explode(array_repeat()) method for large N.

Method 4: Using `flatMap()` on RDD

For maximum flexibility, convert to RDD and use flatMap() to replicate rows:

n = 3

# Convert to RDD, replicate, and convert back
rdd_repeated = df.rdd.flatMap(lambda row: [row] * n)
df_result = spark.createDataFrame(rdd_repeated, df.schema)

df_result.show()

Output:

+-------+------------+-----------+
|   Name|Repeat_Count| Department|
+-------+------------+-----------+
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
|    Bob|           3|  Marketing|
|    Bob|           3|  Marketing|
|    Bob|           3|  Marketing|
|Charlie|           1|      Sales|
|Charlie|           1|      Sales|
|Charlie|           1|      Sales|
|  Diana|           4|Engineering|
|  Diana|           4|Engineering|
|  Diana|           4|Engineering|
+-------+------------+-----------+

This approach is simple and works well, but converting between DataFrame and RDD adds some overhead.

Method 5: Duplicating a Specific Row

To duplicate a specific row (not all rows), filter it first and then union:

from functools import reduce
from pyspark.sql import DataFrame

# Select the row to duplicate
target_row = df.filter(df.Name == "Alice")
n = 4  # Number of copies

# Create N copies and union them
copies = [target_row] * n
duplicated_row = reduce(DataFrame.union, copies)

# Combine with original DataFrame
df_result = df.union(duplicated_row)

df_result.show()

Output:

+-------+------------+-----------+
|   Name|Repeat_Count| Department|
+-------+------------+-----------+
|  Alice|           2|Engineering|
|    Bob|           3|  Marketing|
|Charlie|           1|      Sales|
|  Diana|           4|Engineering|
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
|  Alice|           2|Engineering|
+-------+------------+-----------+

Alice's row now appears 5 times total (1 original + 4 copies).

Adding a Copy Identifier

To track which copy each row belongs to, add an index column during duplication:

from pyspark.sql.functions import explode, sequence, lit

n = 3

df_with_copy_id = df.withColumn(
    "copy_id",
    explode(sequence(lit(1), lit(n)))
)

df_with_copy_id.orderBy("Name", "copy_id").show()

Output:

+-------+------------+-----------+-------+
|   Name|Repeat_Count| Department|copy_id|
+-------+------------+-----------+-------+
|  Alice|           2|Engineering|      1|
|  Alice|           2|Engineering|      2|
|  Alice|           2|Engineering|      3|
|    Bob|           3|  Marketing|      1|
|    Bob|           3|  Marketing|      2|
|    Bob|           3|  Marketing|      3|
|Charlie|           1|      Sales|      1|
|Charlie|           1|      Sales|      2|
|Charlie|           1|      Sales|      3|
|  Diana|           4|Engineering|      1|
|  Diana|           4|Engineering|      2|
|  Diana|           4|Engineering|      3|
+-------+------------+-----------+-------+

The copy_id column identifies each copy (1 = original, 2 = first copy, etc.).

Quick Comparison of Methods

Method	Fixed N	Column-Based N	Specific Row	Performance
`explode(array_repeat())`	✅	✅	❌	⚡ Best
`union()` loop	✅	❌	✅	🔶 OK for small N
`flatMap()` on RDD	✅	✅	❌	🔶 Moderate (RDD overhead)
Filter + union	❌	❌	✅	🔶 Moderate

Conclusion

PySpark provides several ways to duplicate rows in a DataFrame:

Use explode(array_repeat()) for the most efficient and scalable approach. It works natively with Spark's catalyst optimizer and handles both fixed N and column-based repetition.
Use union() in a loop for simple cases with small N values or when duplicating specific rows.
Use flatMap() on RDD when you need maximum flexibility in the replication logic.
Add a copy_id column with sequence() when you need to track which copy each row represents.

For most production use cases, the explode(array_repeat()) approach is the recommended method due to its performance and simplicity.

Setting Up the Example​

Method 1: Repeating Rows Based on a Column Value​

Method 2: Duplicating All Rows a Fixed N Times​

Method 3: Using union() in a Loop​

Method 4: Using flatMap() on RDD​

Method 5: Duplicating a Specific Row​

Adding a Copy Identifier​

Quick Comparison of Methods​

Conclusion​

Table of Contents

Setting Up the Example

Method 1: Repeating Rows Based on a Column Value

Method 2: Duplicating All Rows a Fixed N Times

Method 3: Using `union()` in a Loop

Method 4: Using `flatMap()` on RDD

Method 5: Duplicating a Specific Row

Adding a Copy Identifier

Quick Comparison of Methods

Conclusion