Python PySpark: How to Check for a Substring in a PySpark DataFrame

Working with string data is extremely common in PySpark, especially when processing logs, identifiers, or semi-structured text. One frequent requirement is to check for or extract substrings from columns in a PySpark DataFrame - whether you're parsing composite fields, extracting codes from identifiers, or deriving new analytical columns.

In this guide, you'll learn multiple methods to extract and work with substrings in PySpark, including column-based APIs, SQL-style expressions, and filtering based on substring matches.

What Is a Substring?

A substring is a continuous sequence of characters within a larger string. For example:

Full string: "MH201411094334"
Substring at position 1, length 2: "MH" (the state code)
Substring at position 3, length 4: "2014" (the registration year)

In PySpark, substring operations are commonly used to:

Extract codes from composite identifiers
Parse dates stored as strings
Split fixed-width fields into meaningful columns
Filter rows based on partial string matches

Creating a Sample DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("substring_demo").getOrCreate()

columns = ["LicenseNo", "ExpiryDate"]

data = [
    ("MH201411094334", "2024-11-19"),
    ("AR202027563890", "2030-03-16"),
    ("UP202010345567", "2035-12-30"),
    ("KN201822347800", "2028-10-29"),
]

reg_df = spark.createDataFrame(data=data, schema=columns)
reg_df.show()

Output:

+--------------+----------+
|     LicenseNo|ExpiryDate|
+--------------+----------+
|MH201411094334|2024-11-19|
|AR202027563890|2030-03-16|
|UP202010345567|2035-12-30|
|KN201822347800|2028-10-29|
+--------------+----------+

The LicenseNo column contains a composite identifier:

Characters 1–2: State code (e.g., MH)
Characters 3–6: Registration year (e.g., 2014)
Characters 7–14: Registration ID (e.g., 11094334)

Method 1: Using `withColumn()` and `substring()`

The most common approach uses withColumn() to add new columns with extracted substrings:

from pyspark.sql.functions import substring

result = reg_df \
    .withColumn("State", substring("LicenseNo", 1, 2)) \
    .withColumn("RegYear", substring("LicenseNo", 3, 4)) \
    .withColumn("RegID", substring("LicenseNo", 7, 8))

result.show()

Output:

+--------------+----------+-----+-------+--------+
|     LicenseNo|ExpiryDate|State|RegYear|   RegID|
+--------------+----------+-----+-------+--------+
|MH201411094334|2024-11-19|   MH|   2014|11094334|
|AR202027563890|2030-03-16|   AR|   2020|27563890|
|UP202010345567|2035-12-30|   UP|   2020|10345567|
|KN201822347800|2028-10-29|   KN|   2018|22347800|
+--------------+----------+-----+-------+--------+

Syntax

substring(column, start_position, length)

Parameter	Description
`column`	Column name or Column object
`start_position`	Starting position (1-based, not 0-based)
`length`	Number of characters to extract

Important: 1-Based Indexing

PySpark substring positions start at 1, not 0. This is different from Python's string slicing:

# ❌ Wrong: position 0 returns incorrect results
substring("LicenseNo", 0, 2)

# ✅ Correct: position 1 gets the first two characters
substring("LicenseNo", 1, 2)

This is a common source of bugs when transitioning from Python to PySpark.

Extracting from Multiple Columns

You can extract substrings from any string column in the same chain:

from pyspark.sql.functions import substring

result = reg_df \
    .withColumn("State", substring("LicenseNo", 1, 2)) \
    .withColumn("RegYear", substring("LicenseNo", 3, 4)) \
    .withColumn("RegID", substring("LicenseNo", 7, 8)) \
    .withColumn("ExpYear", substring("ExpiryDate", 1, 4)) \
    .withColumn("ExpMonth", substring("ExpiryDate", 6, 2)) \
    .withColumn("ExpDay", substring("ExpiryDate", 9, 2))

result.show()

Output:

+--------------+----------+-----+-------+--------+-------+--------+------+
|     LicenseNo|ExpiryDate|State|RegYear|   RegID|ExpYear|ExpMonth|ExpDay|
+--------------+----------+-----+-------+--------+-------+--------+------+
|MH201411094334|2024-11-19|   MH|   2014|11094334|   2024|      11|    19|
|AR202027563890|2030-03-16|   AR|   2020|27563890|   2030|      03|    16|
|UP202010345567|2035-12-30|   UP|   2020|10345567|   2035|      12|    30|
|KN201822347800|2028-10-29|   KN|   2018|22347800|   2028|      10|    29|
+--------------+----------+-----+-------+--------+-------+--------+------+

Method 2: Using `substr()` on Column Objects

Instead of importing the substring() function, you can call substr() directly on a Column object. This provides a more object-oriented style:

from pyspark.sql.functions import col

result = reg_df \
    .withColumn("State", col("LicenseNo").substr(1, 2)) \
    .withColumn("RegYear", col("LicenseNo").substr(3, 4)) \
    .withColumn("RegID", col("LicenseNo").substr(7, 8))

result.show()

Output:

+--------------+----------+-----+-------+--------+
|     LicenseNo|ExpiryDate|State|RegYear|   RegID|
+--------------+----------+-----+-------+--------+
|MH201411094334|2024-11-19|   MH|   2014|11094334|
|AR202027563890|2030-03-16|   AR|   2020|27563890|
|UP202010345567|2035-12-30|   UP|   2020|10345567|
|KN201822347800|2028-10-29|   KN|   2018|22347800|
+--------------+----------+-----+-------+--------+

Both substring() and substr() produce identical results and have the same performance. Choose whichever fits your coding style.

Method 3: Using `select()` for Derived Columns Only

If you only need the extracted columns and don't want to keep the originals, use select():

from pyspark.sql.functions import substring

result = reg_df.select(
    substring("LicenseNo", 1, 2).alias("State"),
    substring("LicenseNo", 3, 4).alias("RegYear"),
    substring("LicenseNo", 7, 8).alias("RegID"),
    substring("ExpiryDate", 1, 4).alias("ExpYear"),
    substring("ExpiryDate", 6, 2).alias("ExpMonth"),
    substring("ExpiryDate", 9, 2).alias("ExpDay"),
)

result.show()

Output:

+-----+-------+--------+-------+--------+------+
|State|RegYear|   RegID|ExpYear|ExpMonth|ExpDay|
+-----+-------+--------+-------+--------+------+
|   MH|   2014|11094334|   2024|      11|    19|
|   AR|   2020|27563890|   2030|      03|    16|
|   UP|   2020|10345567|   2035|      12|    30|
|   KN|   2018|22347800|   2028|      10|    29|
+-----+-------+--------+-------+--------+------+

tip

Use .alias() to name your derived columns. Without it, PySpark generates auto-generated names like substring(LicenseNo, 1, 2), which are hard to work with downstream.

Method 4: Using SQL with `spark.sql()`

PySpark supports full SQL queries. Register the DataFrame as a temporary view and use SUBSTR():

reg_df.createOrReplaceTempView("registrations")

result = spark.sql("""
    SELECT
        SUBSTR(LicenseNo, 1, 2)  AS State,
        SUBSTR(LicenseNo, 3, 4)  AS RegYear,
        SUBSTR(LicenseNo, 7, 8)  AS RegID,
        SUBSTR(ExpiryDate, 1, 4) AS ExpYear,
        SUBSTR(ExpiryDate, 6, 2) AS ExpMonth,
        SUBSTR(ExpiryDate, 9, 2) AS ExpDay
    FROM registrations
""")

result.show()

This produces the same output as the previous methods. SQL is especially useful when migrating existing queries to Spark or when your team is more comfortable with SQL syntax.

Method 5: Using `selectExpr()`

selectExpr() is a hybrid approach that lets you write SQL expressions without creating a temporary view:

result = reg_df.selectExpr(
    "LicenseNo",
    "ExpiryDate",
    "substring(LicenseNo, 1, 2) AS State",
    "substring(LicenseNo, 3, 4) AS RegYear",
    "substring(LicenseNo, 7, 8) AS RegID"
)

result.show()

Output:

+--------------+----------+-----+-------+--------+
|     LicenseNo|ExpiryDate|State|RegYear|   RegID|
+--------------+----------+-----+-------+--------+
|MH201411094334|2024-11-19|   MH|   2014|11094334|
|AR202027563890|2030-03-16|   AR|   2020|27563890|
|UP202010345567|2035-12-30|   UP|   2020|10345567|
|KN201822347800|2028-10-29|   KN|   2018|22347800|
+--------------+----------+-----+-------+--------+

Filtering Rows by Substring Match

Beyond extracting substrings, you often need to filter rows based on whether a column contains a specific substring:

Using `contains()`

from pyspark.sql.functions import col

# Find all registrations from Maharashtra (MH)
result = reg_df.filter(col("LicenseNo").contains("MH"))
result.show()

Output:

+--------------+----------+
|     LicenseNo|ExpiryDate|
+--------------+----------+
|MH201411094334|2024-11-19|
+--------------+----------+

Using `startswith()` and `endswith()`

from pyspark.sql.functions import col

# Licenses starting with "AR"
reg_df.filter(col("LicenseNo").startswith("AR")).show()

# Expiry dates ending with "30"
reg_df.filter(col("ExpiryDate").endswith("30")).show()

Using `like()` for Pattern Matching

from pyspark.sql.functions import col

# Licenses from states starting with "M" or "K"
result = reg_df.filter(col("LicenseNo").like("M%") | col("LicenseNo").like("K%"))
result.show()

Output:

+--------------+----------+
|     LicenseNo|ExpiryDate|
+--------------+----------+
|MH201411094334|2024-11-19|
|KN201822347800|2028-10-29|
+--------------+----------+

Using `rlike()` for Regex Matching

from pyspark.sql.functions import col

# Licenses registered in 2020
result = reg_df.filter(col("LicenseNo").rlike("^[A-Z]{2}2020"))
result.show()

Output:

+--------------+----------+
|     LicenseNo|ExpiryDate|
+--------------+----------+
|AR202027563890|2030-03-16|
|UP202010345567|2035-12-30|
+--------------+----------+

Comparison of Methods

Method	Syntax Style	Keeps Original Columns	Best For
`withColumn()` + `substring()`	DataFrame API	✅	Step-by-step transformations
`col().substr()`	Object-oriented	✅	Clean chaining with `col()`
`select()` + `substring()`	DataFrame API	❌ (only selected)	Derived-only output
`spark.sql()`	SQL	Configurable	SQL-heavy teams, complex logic
`selectExpr()`	Hybrid SQL/API	Configurable	Quick SQL expressions without views

Summary

PySpark provides multiple flexible ways to extract and check for substrings in DataFrame columns:

Use withColumn() + substring() for the most readable, step-by-step approach.
Use col().substr() for a clean, object-oriented style.
Use select() when you only need derived columns.
Use spark.sql() or selectExpr() when SQL syntax is preferred.
Use contains(), startswith(), like(), or rlike() to filter rows based on substring matches.

Always remember that PySpark uses 1-based indexing for substring positions, and prefer built-in string functions over UDFs for better performance in distributed execution.

What Is a Substring?​

Creating a Sample DataFrame​

Method 1: Using withColumn() and substring()​

Syntax​

Extracting from Multiple Columns​

Method 2: Using substr() on Column Objects​

Method 3: Using select() for Derived Columns Only​

Method 4: Using SQL with spark.sql()​

Method 5: Using selectExpr()​

Filtering Rows by Substring Match​

Using contains()​

Using startswith() and endswith()​

Using like() for Pattern Matching​

Using rlike() for Regex Matching​

Comparison of Methods​

Summary​

Table of Contents

What Is a Substring?

Creating a Sample DataFrame

Method 1: Using `withColumn()` and `substring()`

Syntax

Extracting from Multiple Columns

Method 2: Using `substr()` on Column Objects

Method 3: Using `select()` for Derived Columns Only

Method 4: Using SQL with `spark.sql()`

Method 5: Using `selectExpr()`

Filtering Rows by Substring Match

Using `contains()`

Using `startswith()` and `endswith()`

Using `like()` for Pattern Matching

Using `rlike()` for Regex Matching

Comparison of Methods

Summary