Skip to main content

Python PySpark: How to Check for a Substring in a PySpark DataFrame

Working with string data is extremely common in PySpark, especially when processing logs, identifiers, or semi-structured text. One frequent requirement is to check for or extract substrings from columns in a PySpark DataFrame - whether you're parsing composite fields, extracting codes from identifiers, or deriving new analytical columns.

In this guide, you'll learn multiple methods to extract and work with substrings in PySpark, including column-based APIs, SQL-style expressions, and filtering based on substring matches.

What Is a Substring?

A substring is a continuous sequence of characters within a larger string. For example:

  • Full string: "MH201411094334"
  • Substring at position 1, length 2: "MH" (the state code)
  • Substring at position 3, length 4: "2014" (the registration year)

In PySpark, substring operations are commonly used to:

  • Extract codes from composite identifiers
  • Parse dates stored as strings
  • Split fixed-width fields into meaningful columns
  • Filter rows based on partial string matches

Creating a Sample DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("substring_demo").getOrCreate()

columns = ["LicenseNo", "ExpiryDate"]

data = [
("MH201411094334", "2024-11-19"),
("AR202027563890", "2030-03-16"),
("UP202010345567", "2035-12-30"),
("KN201822347800", "2028-10-29"),
]

reg_df = spark.createDataFrame(data=data, schema=columns)
reg_df.show()

Output:

+--------------+----------+
| LicenseNo|ExpiryDate|
+--------------+----------+
|MH201411094334|2024-11-19|
|AR202027563890|2030-03-16|
|UP202010345567|2035-12-30|
|KN201822347800|2028-10-29|
+--------------+----------+

The LicenseNo column contains a composite identifier:

  • Characters 1–2: State code (e.g., MH)
  • Characters 3–6: Registration year (e.g., 2014)
  • Characters 7–14: Registration ID (e.g., 11094334)

Method 1: Using withColumn() and substring()

The most common approach uses withColumn() to add new columns with extracted substrings:

from pyspark.sql.functions import substring

result = reg_df \
.withColumn("State", substring("LicenseNo", 1, 2)) \
.withColumn("RegYear", substring("LicenseNo", 3, 4)) \
.withColumn("RegID", substring("LicenseNo", 7, 8))

result.show()

Output:

+--------------+----------+-----+-------+--------+
| LicenseNo|ExpiryDate|State|RegYear| RegID|
+--------------+----------+-----+-------+--------+
|MH201411094334|2024-11-19| MH| 2014|11094334|
|AR202027563890|2030-03-16| AR| 2020|27563890|
|UP202010345567|2035-12-30| UP| 2020|10345567|
|KN201822347800|2028-10-29| KN| 2018|22347800|
+--------------+----------+-----+-------+--------+

Syntax

substring(column, start_position, length)
ParameterDescription
columnColumn name or Column object
start_positionStarting position (1-based, not 0-based)
lengthNumber of characters to extract
Important: 1-Based Indexing

PySpark substring positions start at 1, not 0. This is different from Python's string slicing:

# ❌ Wrong: position 0 returns incorrect results
substring("LicenseNo", 0, 2)

# ✅ Correct: position 1 gets the first two characters
substring("LicenseNo", 1, 2)

This is a common source of bugs when transitioning from Python to PySpark.

Extracting from Multiple Columns

You can extract substrings from any string column in the same chain:

from pyspark.sql.functions import substring

result = reg_df \
.withColumn("State", substring("LicenseNo", 1, 2)) \
.withColumn("RegYear", substring("LicenseNo", 3, 4)) \
.withColumn("RegID", substring("LicenseNo", 7, 8)) \
.withColumn("ExpYear", substring("ExpiryDate", 1, 4)) \
.withColumn("ExpMonth", substring("ExpiryDate", 6, 2)) \
.withColumn("ExpDay", substring("ExpiryDate", 9, 2))

result.show()

Output:

+--------------+----------+-----+-------+--------+-------+--------+------+
| LicenseNo|ExpiryDate|State|RegYear| RegID|ExpYear|ExpMonth|ExpDay|
+--------------+----------+-----+-------+--------+-------+--------+------+
|MH201411094334|2024-11-19| MH| 2014|11094334| 2024| 11| 19|
|AR202027563890|2030-03-16| AR| 2020|27563890| 2030| 03| 16|
|UP202010345567|2035-12-30| UP| 2020|10345567| 2035| 12| 30|
|KN201822347800|2028-10-29| KN| 2018|22347800| 2028| 10| 29|
+--------------+----------+-----+-------+--------+-------+--------+------+

Method 2: Using substr() on Column Objects

Instead of importing the substring() function, you can call substr() directly on a Column object. This provides a more object-oriented style:

from pyspark.sql.functions import col

result = reg_df \
.withColumn("State", col("LicenseNo").substr(1, 2)) \
.withColumn("RegYear", col("LicenseNo").substr(3, 4)) \
.withColumn("RegID", col("LicenseNo").substr(7, 8))

result.show()

Output:

+--------------+----------+-----+-------+--------+
| LicenseNo|ExpiryDate|State|RegYear| RegID|
+--------------+----------+-----+-------+--------+
|MH201411094334|2024-11-19| MH| 2014|11094334|
|AR202027563890|2030-03-16| AR| 2020|27563890|
|UP202010345567|2035-12-30| UP| 2020|10345567|
|KN201822347800|2028-10-29| KN| 2018|22347800|
+--------------+----------+-----+-------+--------+

Both substring() and substr() produce identical results and have the same performance. Choose whichever fits your coding style.

Method 3: Using select() for Derived Columns Only

If you only need the extracted columns and don't want to keep the originals, use select():

from pyspark.sql.functions import substring

result = reg_df.select(
substring("LicenseNo", 1, 2).alias("State"),
substring("LicenseNo", 3, 4).alias("RegYear"),
substring("LicenseNo", 7, 8).alias("RegID"),
substring("ExpiryDate", 1, 4).alias("ExpYear"),
substring("ExpiryDate", 6, 2).alias("ExpMonth"),
substring("ExpiryDate", 9, 2).alias("ExpDay"),
)

result.show()

Output:

+-----+-------+--------+-------+--------+------+
|State|RegYear| RegID|ExpYear|ExpMonth|ExpDay|
+-----+-------+--------+-------+--------+------+
| MH| 2014|11094334| 2024| 11| 19|
| AR| 2020|27563890| 2030| 03| 16|
| UP| 2020|10345567| 2035| 12| 30|
| KN| 2018|22347800| 2028| 10| 29|
+-----+-------+--------+-------+--------+------+
tip

Use .alias() to name your derived columns. Without it, PySpark generates auto-generated names like substring(LicenseNo, 1, 2), which are hard to work with downstream.

Method 4: Using SQL with spark.sql()

PySpark supports full SQL queries. Register the DataFrame as a temporary view and use SUBSTR():

reg_df.createOrReplaceTempView("registrations")

result = spark.sql("""
SELECT
SUBSTR(LicenseNo, 1, 2) AS State,
SUBSTR(LicenseNo, 3, 4) AS RegYear,
SUBSTR(LicenseNo, 7, 8) AS RegID,
SUBSTR(ExpiryDate, 1, 4) AS ExpYear,
SUBSTR(ExpiryDate, 6, 2) AS ExpMonth,
SUBSTR(ExpiryDate, 9, 2) AS ExpDay
FROM registrations
""")

result.show()

This produces the same output as the previous methods. SQL is especially useful when migrating existing queries to Spark or when your team is more comfortable with SQL syntax.

Method 5: Using selectExpr()

selectExpr() is a hybrid approach that lets you write SQL expressions without creating a temporary view:

result = reg_df.selectExpr(
"LicenseNo",
"ExpiryDate",
"substring(LicenseNo, 1, 2) AS State",
"substring(LicenseNo, 3, 4) AS RegYear",
"substring(LicenseNo, 7, 8) AS RegID"
)

result.show()

Output:

+--------------+----------+-----+-------+--------+
| LicenseNo|ExpiryDate|State|RegYear| RegID|
+--------------+----------+-----+-------+--------+
|MH201411094334|2024-11-19| MH| 2014|11094334|
|AR202027563890|2030-03-16| AR| 2020|27563890|
|UP202010345567|2035-12-30| UP| 2020|10345567|
|KN201822347800|2028-10-29| KN| 2018|22347800|
+--------------+----------+-----+-------+--------+

Filtering Rows by Substring Match

Beyond extracting substrings, you often need to filter rows based on whether a column contains a specific substring:

Using contains()

from pyspark.sql.functions import col

# Find all registrations from Maharashtra (MH)
result = reg_df.filter(col("LicenseNo").contains("MH"))
result.show()

Output:

+--------------+----------+
| LicenseNo|ExpiryDate|
+--------------+----------+
|MH201411094334|2024-11-19|
+--------------+----------+

Using startswith() and endswith()

from pyspark.sql.functions import col

# Licenses starting with "AR"
reg_df.filter(col("LicenseNo").startswith("AR")).show()

# Expiry dates ending with "30"
reg_df.filter(col("ExpiryDate").endswith("30")).show()

Using like() for Pattern Matching

from pyspark.sql.functions import col

# Licenses from states starting with "M" or "K"
result = reg_df.filter(col("LicenseNo").like("M%") | col("LicenseNo").like("K%"))
result.show()

Output:

+--------------+----------+
| LicenseNo|ExpiryDate|
+--------------+----------+
|MH201411094334|2024-11-19|
|KN201822347800|2028-10-29|
+--------------+----------+

Using rlike() for Regex Matching

from pyspark.sql.functions import col

# Licenses registered in 2020
result = reg_df.filter(col("LicenseNo").rlike("^[A-Z]{2}2020"))
result.show()

Output:

+--------------+----------+
| LicenseNo|ExpiryDate|
+--------------+----------+
|AR202027563890|2030-03-16|
|UP202010345567|2035-12-30|
+--------------+----------+

Comparison of Methods

MethodSyntax StyleKeeps Original ColumnsBest For
withColumn() + substring()DataFrame APIStep-by-step transformations
col().substr()Object-orientedClean chaining with col()
select() + substring()DataFrame API❌ (only selected)Derived-only output
spark.sql()SQLConfigurableSQL-heavy teams, complex logic
selectExpr()Hybrid SQL/APIConfigurableQuick SQL expressions without views

Summary

PySpark provides multiple flexible ways to extract and check for substrings in DataFrame columns:

  • Use withColumn() + substring() for the most readable, step-by-step approach.
  • Use col().substr() for a clean, object-oriented style.
  • Use select() when you only need derived columns.
  • Use spark.sql() or selectExpr() when SQL syntax is preferred.
  • Use contains(), startswith(), like(), or rlike() to filter rows based on substring matches.

Always remember that PySpark uses 1-based indexing for substring positions, and prefer built-in string functions over UDFs for better performance in distributed execution.