Skip to main content

Python PySpark: How to Use isNotNull() in PySpark to Filter Null Values

Null values are a common challenge in data processing - they can appear due to missing entries, data transfer errors, or schema mismatches. Identifying and filtering these null values is essential for accurate analysis. PySpark provides the isNotNull() method on the Column class to check for non-null values, making it straightforward to clean your DataFrames.

This guide explains how to use isNotNull() to filter rows, handle multiple columns, and understand its counterpart isNull().

Understanding isNotNull()

The isNotNull() method is called on a DataFrame column and returns a boolean expression:

  • True if the value in that row is not null
  • False if the value is null

This boolean expression is then passed to the filter() (or where()) method to keep only rows that meet the condition.

# Basic syntax
df.filter(df["column_name"].isNotNull())

# Equivalent alternative syntax
df.filter(df.column_name.isNotNull())

Setting Up a Sample DataFrame

All examples use the following DataFrame containing None values (Python's representation of null):

from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.appName("NullFiltering").getOrCreate()

data = [
Row(name="Alpha", age=20, marks=54),
Row(name="Beta", age=None, marks=None),
Row(name="Omega", age=17, marks=85),
Row(name="Sigma", age=None, marks=62),
Row(name="Delta", age=22, marks=None)
]

df = spark.createDataFrame(data)
df.show()

Output:

+-----+----+-----+
| name| age|marks|
+-----+----+-----+
|Alpha| 20| 54|
| Beta|null| null|
|Omega| 17| 85|
|Sigma|null| 62|
|Delta| 22| null|
+-----+----+-----+

Three rows contain null values in the age column, the marks column, or both.

Filtering Nulls from a Single Column

Keeping Non-Null Rows in the age Column

filtered_df = df.filter(df["age"].isNotNull())
filtered_df.show()

Output:

+-----+---+-----+
| name|age|marks|
+-----+---+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
|Delta| 22| null|
+-----+---+-----+

Rows where age is null (Beta and Sigma) are removed. Notice that Delta is kept because its age is not null - even though its marks is null.

Keeping Non-Null Rows in the marks Column

filtered_df = df.filter(df["marks"].isNotNull())
filtered_df.show()

Output:

+-----+----+-----+
| name| age|marks|
+-----+----+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
|Sigma|null| 62|
+-----+----+-----+

Only rows where marks is not null are retained.

Filtering Nulls from Multiple Columns

Using the AND Operator (&)

To keep only rows where all specified columns are non-null, combine conditions with &:

filtered_df = df.filter(
df["age"].isNotNull() & df["marks"].isNotNull()
)
filtered_df.show()

Output:

+-----+---+-----+
| name|age|marks|
+-----+---+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
+-----+---+-----+

Only Alpha and Omega have non-null values in both age and marks.

Using the OR Operator (|)

To keep rows where at least one column is non-null:

filtered_df = df.filter(
df["age"].isNotNull() | df["marks"].isNotNull()
)
filtered_df.show()

Output:

+-----+----+-----+
| name| age|marks|
+-----+----+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
|Sigma|null| 62|
|Delta| 22| null|
+-----+----+-----+

Only Beta is removed because both its age and marks are null.

tip

When combining multiple column conditions, always use & (AND) and | (OR) operators with parentheses around each condition. Python's and and or keywords do not work with PySpark column expressions.

Using isNull() - The Opposite

The isNull() method does the reverse - it returns True when a value is null. Use it to find rows with missing data:

# Find rows where age is null
null_rows = df.filter(df["age"].isNull())
null_rows.show()

Output:

+-----+----+-----+
| name| age|marks|
+-----+----+-----+
| Beta|null| null|
|Sigma|null| 62|
+-----+----+-----+

This is useful for investigating which records have missing data before deciding how to handle them.

Using dropna() as an Alternative

PySpark's dropna() method provides a more concise way to remove rows with null values:

# Drop rows where ANY column has a null value
clean_df = df.dropna()
clean_df.show()

Output:

+-----+---+-----+
| name|age|marks|
+-----+---+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
+-----+---+-----+

Targeting Specific Columns with dropna()

# Drop rows where 'age' column has a null value
clean_df = df.dropna(subset=["age"])
clean_df.show()

Output:

+-----+---+-----+
| name|age|marks|
+-----+---+-----+
|Alpha| 20| 54|
|Omega| 17| 85|
|Delta| 22| null|
+-----+---+-----+

dropna() vs. isNotNull()

FeatureisNotNull()dropna()
Syntaxdf.filter(df["col"].isNotNull())df.dropna(subset=["col"])
Multiple columnsManual with & / ``
Custom logic (AND/OR)YesLimited (how='any' or how='all')
Combine with other conditionsYesNo

Use isNotNull() when you need to combine null checks with other filtering conditions. Use dropna() for straightforward null removal.

Common Mistake: Using Python's and Instead of &

A frequent error is using Python's and keyword to combine PySpark column conditions:

# WRONG: Python's 'and' doesn't work with PySpark columns
try:
filtered = df.filter(df["age"].isNotNull() and df["marks"].isNotNull())
except ValueError as e:
print(f"Error: {e}")

Output:

Error: Cannot convert column into bool...

The correct approach - use & with parentheses:

# CORRECT: use & operator with parentheses
filtered = df.filter(
(df["age"].isNotNull()) & (df["marks"].isNotNull())
)
filtered.show()
danger

Never use Python's and, or, or not keywords with PySpark column expressions. Use & (AND), | (OR), and ~ (NOT) instead. Parentheses around each condition are required due to Python's operator precedence.

Practical Example: Cleaning Before Analysis

Here is a typical workflow that filters nulls before performing calculations:

from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import avg, count

spark = SparkSession.builder.appName("NullFiltering").getOrCreate()

data = [
Row(name="Alpha", age=20, marks=54),
Row(name="Beta", age=None, marks=None),
Row(name="Omega", age=17, marks=85),
Row(name="Sigma", age=None, marks=62),
Row(name="Delta", age=22, marks=None)
]

df = spark.createDataFrame(data)

# Filter out rows with null marks before calculating average
clean_df = df.filter(df["marks"].isNotNull())

clean_df.select(
count("marks").alias("count"),
avg("marks").alias("average_marks")
).show()

Output:

+-----+------------------+
|count| average_marks|
+-----+------------------+
| 3| 67.0 |
+-----+------------------+

Quick Reference

OperationCode
Filter non-null in one columndf.filter(df["col"].isNotNull())
Filter null in one columndf.filter(df["col"].isNull())
Filter non-null in multiple columns (AND)df.filter((df["c1"].isNotNull()) & (df["c2"].isNotNull()))
Filter non-null in any column (OR)df.filter((df["c1"].isNotNull()) | (df["c2"].isNotNull()))
Drop rows with any nulldf.dropna()
Drop rows with null in specific columnsdf.dropna(subset=["col1", "col2"])
Count nulls in a columndf.filter(df["col"].isNull()).count()

The isNotNull() method is a fundamental tool for data cleaning in PySpark. Combined with filter() and logical operators, it gives you precise control over which rows to keep or discard based on null values across one or more columns.