Python PySpark: How to Check for a Substring in a PySpark DataFrame
Working with string data is extremely common in PySpark, especially when processing logs, identifiers, or semi-structured text. One frequent requirement is to check for or extract substrings from columns in a PySpark DataFrame - whether you're parsing composite fields, extracting codes from identifiers, or deriving new analytical columns.
In this guide, you'll learn multiple methods to extract and work with substrings in PySpark, including column-based APIs, SQL-style expressions, and filtering based on substring matches.
What Is a Substring?
A substring is a continuous sequence of characters within a larger string. For example:
- Full string:
"MH201411094334" - Substring at position 1, length 2:
"MH"(the state code) - Substring at position 3, length 4:
"2014"(the registration year)
In PySpark, substring operations are commonly used to:
- Extract codes from composite identifiers
- Parse dates stored as strings
- Split fixed-width fields into meaningful columns
- Filter rows based on partial string matches
Creating a Sample DataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("substring_demo").getOrCreate()
columns = ["LicenseNo", "ExpiryDate"]
data = [
("MH201411094334", "2024-11-19"),
("AR202027563890", "2030-03-16"),
("UP202010345567", "2035-12-30"),
("KN201822347800", "2028-10-29"),
]
reg_df = spark.createDataFrame(data=data, schema=columns)
reg_df.show()
Output:
+--------------+----------+
| LicenseNo|ExpiryDate|
+--------------+----------+
|MH201411094334|2024-11-19|
|AR202027563890|2030-03-16|
|UP202010345567|2035-12-30|
|KN201822347800|2028-10-29|
+--------------+----------+
The LicenseNo column contains a composite identifier:
- Characters 1–2: State code (e.g.,
MH) - Characters 3–6: Registration year (e.g.,
2014) - Characters 7–14: Registration ID (e.g.,
11094334)
Method 1: Using withColumn() and substring()
The most common approach uses withColumn() to add new columns with extracted substrings:
from pyspark.sql.functions import substring
result = reg_df \
.withColumn("State", substring("LicenseNo", 1, 2)) \
.withColumn("RegYear", substring("LicenseNo", 3, 4)) \
.withColumn("RegID", substring("LicenseNo", 7, 8))
result.show()
Output:
+--------------+----------+-----+-------+--------+
| LicenseNo|ExpiryDate|State|RegYear| RegID|
+--------------+----------+-----+-------+--------+
|MH201411094334|2024-11-19| MH| 2014|11094334|
|AR202027563890|2030-03-16| AR| 2020|27563890|
|UP202010345567|2035-12-30| UP| 2020|10345567|
|KN201822347800|2028-10-29| KN| 2018|22347800|
+--------------+----------+-----+-------+--------+
Syntax
substring(column, start_position, length)
| Parameter | Description |
|---|---|
column | Column name or Column object |
start_position | Starting position (1-based, not 0-based) |
length | Number of characters to extract |
PySpark substring positions start at 1, not 0. This is different from Python's string slicing:
# ❌ Wrong: position 0 returns incorrect results
substring("LicenseNo", 0, 2)
# ✅ Correct: position 1 gets the first two characters
substring("LicenseNo", 1, 2)
This is a common source of bugs when transitioning from Python to PySpark.
Extracting from Multiple Columns
You can extract substrings from any string column in the same chain:
from pyspark.sql.functions import substring
result = reg_df \
.withColumn("State", substring("LicenseNo", 1, 2)) \
.withColumn("RegYear", substring("LicenseNo", 3, 4)) \
.withColumn("RegID", substring("LicenseNo", 7, 8)) \
.withColumn("ExpYear", substring("ExpiryDate", 1, 4)) \
.withColumn("ExpMonth", substring("ExpiryDate", 6, 2)) \
.withColumn("ExpDay", substring("ExpiryDate", 9, 2))
result.show()
Output:
+--------------+----------+-----+-------+--------+-------+--------+------+
| LicenseNo|ExpiryDate|State|RegYear| RegID|ExpYear|ExpMonth|ExpDay|
+--------------+----------+-----+-------+--------+-------+--------+------+
|MH201411094334|2024-11-19| MH| 2014|11094334| 2024| 11| 19|
|AR202027563890|2030-03-16| AR| 2020|27563890| 2030| 03| 16|
|UP202010345567|2035-12-30| UP| 2020|10345567| 2035| 12| 30|
|KN201822347800|2028-10-29| KN| 2018|22347800| 2028| 10| 29|
+--------------+----------+-----+-------+--------+-------+--------+------+
Method 2: Using substr() on Column Objects
Instead of importing the substring() function, you can call substr() directly on a Column object. This provides a more object-oriented style:
from pyspark.sql.functions import col
result = reg_df \
.withColumn("State", col("LicenseNo").substr(1, 2)) \
.withColumn("RegYear", col("LicenseNo").substr(3, 4)) \
.withColumn("RegID", col("LicenseNo").substr(7, 8))
result.show()
Output:
+--------------+----------+-----+-------+--------+
| LicenseNo|ExpiryDate|State|RegYear| RegID|
+--------------+----------+-----+-------+--------+
|MH201411094334|2024-11-19| MH| 2014|11094334|
|AR202027563890|2030-03-16| AR| 2020|27563890|
|UP202010345567|2035-12-30| UP| 2020|10345567|
|KN201822347800|2028-10-29| KN| 2018|22347800|
+--------------+----------+-----+-------+--------+
Both substring() and substr() produce identical results and have the same performance. Choose whichever fits your coding style.
Method 3: Using select() for Derived Columns Only
If you only need the extracted columns and don't want to keep the originals, use select():
from pyspark.sql.functions import substring
result = reg_df.select(
substring("LicenseNo", 1, 2).alias("State"),
substring("LicenseNo", 3, 4).alias("RegYear"),
substring("LicenseNo", 7, 8).alias("RegID"),
substring("ExpiryDate", 1, 4).alias("ExpYear"),
substring("ExpiryDate", 6, 2).alias("ExpMonth"),
substring("ExpiryDate", 9, 2).alias("ExpDay"),
)
result.show()
Output:
+-----+-------+--------+-------+--------+------+
|State|RegYear| RegID|ExpYear|ExpMonth|ExpDay|
+-----+-------+--------+-------+--------+------+
| MH| 2014|11094334| 2024| 11| 19|
| AR| 2020|27563890| 2030| 03| 16|
| UP| 2020|10345567| 2035| 12| 30|
| KN| 2018|22347800| 2028| 10| 29|
+-----+-------+--------+-------+--------+------+
Use .alias() to name your derived columns. Without it, PySpark generates auto-generated names like substring(LicenseNo, 1, 2), which are hard to work with downstream.
Method 4: Using SQL with spark.sql()
PySpark supports full SQL queries. Register the DataFrame as a temporary view and use SUBSTR():
reg_df.createOrReplaceTempView("registrations")
result = spark.sql("""
SELECT
SUBSTR(LicenseNo, 1, 2) AS State,
SUBSTR(LicenseNo, 3, 4) AS RegYear,
SUBSTR(LicenseNo, 7, 8) AS RegID,
SUBSTR(ExpiryDate, 1, 4) AS ExpYear,
SUBSTR(ExpiryDate, 6, 2) AS ExpMonth,
SUBSTR(ExpiryDate, 9, 2) AS ExpDay
FROM registrations
""")
result.show()
This produces the same output as the previous methods. SQL is especially useful when migrating existing queries to Spark or when your team is more comfortable with SQL syntax.
Method 5: Using selectExpr()
selectExpr() is a hybrid approach that lets you write SQL expressions without creating a temporary view:
result = reg_df.selectExpr(
"LicenseNo",
"ExpiryDate",
"substring(LicenseNo, 1, 2) AS State",
"substring(LicenseNo, 3, 4) AS RegYear",
"substring(LicenseNo, 7, 8) AS RegID"
)
result.show()
Output:
+--------------+----------+-----+-------+--------+
| LicenseNo|ExpiryDate|State|RegYear| RegID|
+--------------+----------+-----+-------+--------+
|MH201411094334|2024-11-19| MH| 2014|11094334|
|AR202027563890|2030-03-16| AR| 2020|27563890|
|UP202010345567|2035-12-30| UP| 2020|10345567|
|KN201822347800|2028-10-29| KN| 2018|22347800|
+--------------+----------+-----+-------+--------+
Filtering Rows by Substring Match
Beyond extracting substrings, you often need to filter rows based on whether a column contains a specific substring:
Using contains()
from pyspark.sql.functions import col
# Find all registrations from Maharashtra (MH)
result = reg_df.filter(col("LicenseNo").contains("MH"))
result.show()
Output:
+--------------+----------+
| LicenseNo|ExpiryDate|
+--------------+----------+
|MH201411094334|2024-11-19|
+--------------+----------+
Using startswith() and endswith()
from pyspark.sql.functions import col
# Licenses starting with "AR"
reg_df.filter(col("LicenseNo").startswith("AR")).show()
# Expiry dates ending with "30"
reg_df.filter(col("ExpiryDate").endswith("30")).show()
Using like() for Pattern Matching
from pyspark.sql.functions import col
# Licenses from states starting with "M" or "K"
result = reg_df.filter(col("LicenseNo").like("M%") | col("LicenseNo").like("K%"))
result.show()
Output:
+--------------+----------+
| LicenseNo|ExpiryDate|
+--------------+----------+
|MH201411094334|2024-11-19|
|KN201822347800|2028-10-29|
+--------------+----------+
Using rlike() for Regex Matching
from pyspark.sql.functions import col
# Licenses registered in 2020
result = reg_df.filter(col("LicenseNo").rlike("^[A-Z]{2}2020"))
result.show()
Output:
+--------------+----------+
| LicenseNo|ExpiryDate|
+--------------+----------+
|AR202027563890|2030-03-16|
|UP202010345567|2035-12-30|
+--------------+----------+
Comparison of Methods
| Method | Syntax Style | Keeps Original Columns | Best For |
|---|---|---|---|
withColumn() + substring() | DataFrame API | ✅ | Step-by-step transformations |
col().substr() | Object-oriented | ✅ | Clean chaining with col() |
select() + substring() | DataFrame API | ❌ (only selected) | Derived-only output |
spark.sql() | SQL | Configurable | SQL-heavy teams, complex logic |
selectExpr() | Hybrid SQL/API | Configurable | Quick SQL expressions without views |
Summary
PySpark provides multiple flexible ways to extract and check for substrings in DataFrame columns:
- Use
withColumn()+substring()for the most readable, step-by-step approach. - Use
col().substr()for a clean, object-oriented style. - Use
select()when you only need derived columns. - Use
spark.sql()orselectExpr()when SQL syntax is preferred. - Use
contains(),startswith(),like(), orrlike()to filter rows based on substring matches.
Always remember that PySpark uses 1-based indexing for substring positions, and prefer built-in string functions over UDFs for better performance in distributed execution.