Python PySpark: How to Use isNotNull() in PySpark to Filter Null Values
Null values are a common challenge in data processing - they can appear due to missing entries, data transfer errors, or schema mismatches. Identifying and filtering these null values is essential for accurate analysis. PySpark provides the isNotNull() method on the Column class to check for non-null values, making it straightforward to clean your DataFrames.
This guide explains how to use isNotNull() to filter rows, handle multiple columns, and understand its counterpart isNull().
Understanding isNotNull()
The isNotNull() method is called on a DataFrame column and returns a boolean expression:
Trueif the value in that row is not nullFalseif the value is null
This boolean expression is then passed to the filter() (or where()) method to keep only rows that meet the condition.
# Basic syntax
df.filter(df["column_name"].isNotNull())
# Equivalent alternative syntax
df.filter(df.column_name.isNotNull())
Setting Up a Sample DataFrame
All examples use the following DataFrame containing None values (Python's representation of null):
from pyspark.sql import SparkSession, Row
spark = SparkSession.builder.appName("NullFiltering").getOrCreate()
data = [
Row(name="Alpha", age=20, marks=54),
Row(name="Beta", age=None, marks=None),
Row(name="Omega", age=17, marks=85),
Row(name="Sigma", age=None, marks=62),
Row(name="Delta", age=22, marks=None)
]
df = spark.createDataFrame(data)
df.show()
Output:
+-----+----+-----+
| name| age|marks|
+-----+----+-----+
|Alpha| 20| 54|
| Beta|null| null|
|Omega| 17| 85|
|Sigma|null| 62|
|Delta| 22| null|
+-----+----+-----+
Three rows contain null values in the age column, the marks column, or both.
Filtering Nulls from a Single Column
Keeping Non-Null Rows in the age Column
filtered_df = df.filter(df["age"].isNotNull())
filtered_df.show()
Output:
+-----+---+-----+
| name|age|marks|
+-----+---+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
|Delta| 22| null|
+-----+---+-----+
Rows where age is null (Beta and Sigma) are removed. Notice that Delta is kept because its age is not null - even though its marks is null.
Keeping Non-Null Rows in the marks Column
filtered_df = df.filter(df["marks"].isNotNull())
filtered_df.show()
Output:
+-----+----+-----+
| name| age|marks|
+-----+----+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
|Sigma|null| 62|
+-----+----+-----+
Only rows where marks is not null are retained.
Filtering Nulls from Multiple Columns
Using the AND Operator (&)
To keep only rows where all specified columns are non-null, combine conditions with &:
filtered_df = df.filter(
df["age"].isNotNull() & df["marks"].isNotNull()
)
filtered_df.show()
Output:
+-----+---+-----+
| name|age|marks|
+-----+---+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
+-----+---+-----+
Only Alpha and Omega have non-null values in both age and marks.
Using the OR Operator (|)
To keep rows where at least one column is non-null:
filtered_df = df.filter(
df["age"].isNotNull() | df["marks"].isNotNull()
)
filtered_df.show()
Output:
+-----+----+-----+
| name| age|marks|
+-----+----+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
|Sigma|null| 62|
|Delta| 22| null|
+-----+----+-----+
Only Beta is removed because both its age and marks are null.
When combining multiple column conditions, always use & (AND) and | (OR) operators with parentheses around each condition. Python's and and or keywords do not work with PySpark column expressions.
Using isNull() - The Opposite
The isNull() method does the reverse - it returns True when a value is null. Use it to find rows with missing data:
# Find rows where age is null
null_rows = df.filter(df["age"].isNull())
null_rows.show()
Output:
+-----+----+-----+
| name| age|marks|
+-----+----+-----+
| Beta|null| null|
|Sigma|null| 62|
+-----+----+-----+
This is useful for investigating which records have missing data before deciding how to handle them.
Using dropna() as an Alternative
PySpark's dropna() method provides a more concise way to remove rows with null values:
# Drop rows where ANY column has a null value
clean_df = df.dropna()
clean_df.show()
Output:
+-----+---+-----+
| name|age|marks|
+-----+---+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
+-----+---+-----+
Targeting Specific Columns with dropna()
# Drop rows where 'age' column has a null value
clean_df = df.dropna(subset=["age"])
clean_df.show()
Output:
+-----+---+-----+
| name|age|marks|
+-----+---+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
|Delta| 22| null|
+-----+---+-----+
dropna() vs. isNotNull()
| Feature | isNotNull() | dropna() |
|---|---|---|
| Syntax | df.filter(df["col"].isNotNull()) | df.dropna(subset=["col"]) |
| Multiple columns | Manual with & / ` | ` |
| Custom logic (AND/OR) | Yes | Limited (how='any' or how='all') |
| Combine with other conditions | Yes | No |
Use isNotNull() when you need to combine null checks with other filtering conditions. Use dropna() for straightforward null removal.
Common Mistake: Using Python's and Instead of &
A frequent error is using Python's and keyword to combine PySpark column conditions:
# WRONG: Python's 'and' doesn't work with PySpark columns
try:
filtered = df.filter(df["age"].isNotNull() and df["marks"].isNotNull())
except ValueError as e:
print(f"Error: {e}")
Output:
Error: Cannot convert column into bool...
The correct approach - use & with parentheses:
# CORRECT: use & operator with parentheses
filtered = df.filter(
(df["age"].isNotNull()) & (df["marks"].isNotNull())
)
filtered.show()
Never use Python's and, or, or not keywords with PySpark column expressions. Use & (AND), | (OR), and ~ (NOT) instead. Parentheses around each condition are required due to Python's operator precedence.
Practical Example: Cleaning Before Analysis
Here is a typical workflow that filters nulls before performing calculations:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import avg, count
spark = SparkSession.builder.appName("NullFiltering").getOrCreate()
data = [
Row(name="Alpha", age=20, marks=54),
Row(name="Beta", age=None, marks=None),
Row(name="Omega", age=17, marks=85),
Row(name="Sigma", age=None, marks=62),
Row(name="Delta", age=22, marks=None)
]
df = spark.createDataFrame(data)
# Filter out rows with null marks before calculating average
clean_df = df.filter(df["marks"].isNotNull())
clean_df.select(
count("marks").alias("count"),
avg("marks").alias("average_marks")
).show()
Output:
+-----+------------------+
|count| average_marks|
+-----+------------------+
| 3| 67.0 |
+-----+------------------+
Quick Reference
| Operation | Code |
|---|---|
| Filter non-null in one column | df.filter(df["col"].isNotNull()) |
| Filter null in one column | df.filter(df["col"].isNull()) |
| Filter non-null in multiple columns (AND) | df.filter((df["c1"].isNotNull()) & (df["c2"].isNotNull())) |
| Filter non-null in any column (OR) | df.filter((df["c1"].isNotNull()) | (df["c2"].isNotNull())) |
| Drop rows with any null | df.dropna() |
| Drop rows with null in specific columns | df.dropna(subset=["col1", "col2"]) |
| Count nulls in a column | df.filter(df["col"].isNull()).count() |
The isNotNull() method is a fundamental tool for data cleaning in PySpark. Combined with filter() and logical operators, it gives you precise control over which rows to keep or discard based on null values across one or more columns.