Skip to main content

How to Duplicate a Row N Times in a PySpark DataFrame

When working with PySpark DataFrames, you may need to duplicate rows, whether for data augmentation, testing with larger datasets, generating repeated records based on a column value, or creating weighted samples. PySpark provides several approaches to replicate rows efficiently across distributed data.

In this guide, you'll learn multiple methods to duplicate rows in a PySpark DataFrame, from column-value-based repetition to fixed N-time duplication.

Setting Up the Example

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DuplicateRows").getOrCreate()

data = [
("Alice", 2, "Engineering"),
("Bob", 3, "Marketing"),
("Charlie", 1, "Sales"),
("Diana", 4, "Engineering")
]

columns = ["Name", "Repeat_Count", "Department"]
df = spark.createDataFrame(data, columns)

df.show()

Output:

+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
+-------+------------+-----------+

Method 1: Repeating Rows Based on a Column Value

When you want each row duplicated a number of times specified by a column value, use array_repeat() with explode():

from pyspark.sql.functions import expr

# Repeat each row based on the 'Repeat_Count' column
df_repeated = df.withColumn(
"Repeat_Count",
expr("explode(array_repeat(Repeat_Count, int(Repeat_Count)))")
)

df_repeated.show()

Output:

+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
+-------+------------+-----------+

How it works:

  1. array_repeat(Repeat_Count, int(Repeat_Count)) creates an array containing the value repeated N times (e.g., [2, 2] for Alice).
  2. explode() converts each array element into a separate row.
  3. The result is each row appearing as many times as specified by its Repeat_Count value.

Method 2: Duplicating All Rows a Fixed N Times

To duplicate every row the same number of times, create an array of size N and explode it:

from pyspark.sql.functions import explode, array_repeat, lit, col

n = 3 # Number of times to duplicate each row

df_duplicated = df.withColumn("temp", explode(array_repeat(lit(1), n)))
df_duplicated = df_duplicated.drop("temp")

df_duplicated.show()

Output:

+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
+-------+------------+-----------+

Each row appears exactly 3 times. The temporary column created by explode() is dropped since it's not needed.

tip

Using lit(1) creates a dummy value to repeat. The actual value doesn't matter, explode() simply generates N rows. You could use lit("x") or any constant.

Method 3: Using union() in a Loop

A straightforward approach is to union the DataFrame with itself N times:

n = 3  # Number of total copies (original + duplicates)

df_result = df
for _ in range(n - 1):
df_result = df_result.union(df)

print(f"Original rows: {df.count()}")
print(f"After duplication: {df_result.count()}")

df_result.orderBy("Name").show()

Output:

Original rows: 4
After duplication: 12

+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
+-------+------------+-----------+
Performance Note

The union() loop approach creates a chain of DataFrames in the execution plan. For small values of N (2-5), this is fine. For large N values, the execution plan becomes deeply nested and can cause performance issues or stack overflow errors. Prefer the explode(array_repeat()) method for large N.

Method 4: Using flatMap() on RDD

For maximum flexibility, convert to RDD and use flatMap() to replicate rows:

n = 3

# Convert to RDD, replicate, and convert back
rdd_repeated = df.rdd.flatMap(lambda row: [row] * n)
df_result = spark.createDataFrame(rdd_repeated, df.schema)

df_result.show()

Output:

+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
| Diana| 4|Engineering|
+-------+------------+-----------+

This approach is simple and works well, but converting between DataFrame and RDD adds some overhead.

Method 5: Duplicating a Specific Row

To duplicate a specific row (not all rows), filter it first and then union:

from functools import reduce
from pyspark.sql import DataFrame

# Select the row to duplicate
target_row = df.filter(df.Name == "Alice")
n = 4 # Number of copies

# Create N copies and union them
copies = [target_row] * n
duplicated_row = reduce(DataFrame.union, copies)

# Combine with original DataFrame
df_result = df.union(duplicated_row)

df_result.show()

Output:

+-------+------------+-----------+
| Name|Repeat_Count| Department|
+-------+------------+-----------+
| Alice| 2|Engineering|
| Bob| 3| Marketing|
|Charlie| 1| Sales|
| Diana| 4|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
| Alice| 2|Engineering|
+-------+------------+-----------+

Alice's row now appears 5 times total (1 original + 4 copies).

Adding a Copy Identifier

To track which copy each row belongs to, add an index column during duplication:

from pyspark.sql.functions import explode, sequence, lit

n = 3

df_with_copy_id = df.withColumn(
"copy_id",
explode(sequence(lit(1), lit(n)))
)

df_with_copy_id.orderBy("Name", "copy_id").show()

Output:

+-------+------------+-----------+-------+
| Name|Repeat_Count| Department|copy_id|
+-------+------------+-----------+-------+
| Alice| 2|Engineering| 1|
| Alice| 2|Engineering| 2|
| Alice| 2|Engineering| 3|
| Bob| 3| Marketing| 1|
| Bob| 3| Marketing| 2|
| Bob| 3| Marketing| 3|
|Charlie| 1| Sales| 1|
|Charlie| 1| Sales| 2|
|Charlie| 1| Sales| 3|
| Diana| 4|Engineering| 1|
| Diana| 4|Engineering| 2|
| Diana| 4|Engineering| 3|
+-------+------------+-----------+-------+

The copy_id column identifies each copy (1 = original, 2 = first copy, etc.).

Quick Comparison of Methods

MethodFixed NColumn-Based NSpecific RowPerformance
explode(array_repeat())⚡ Best
union() loop🔶 OK for small N
flatMap() on RDD🔶 Moderate (RDD overhead)
Filter + union🔶 Moderate

Conclusion

PySpark provides several ways to duplicate rows in a DataFrame:

  • Use explode(array_repeat()) for the most efficient and scalable approach. It works natively with Spark's catalyst optimizer and handles both fixed N and column-based repetition.
  • Use union() in a loop for simple cases with small N values or when duplicating specific rows.
  • Use flatMap() on RDD when you need maximum flexibility in the replication logic.
  • Add a copy_id column with sequence() when you need to track which copy each row represents.

For most production use cases, the explode(array_repeat()) approach is the recommended method due to its performance and simplicity.