Skip to main content

Python PySpark: How to Check if a PySpark DataFrame Is Empty

When building data pipelines with PySpark, it's common to encounter situations where a DataFrame might be empty, perhaps a filter removed all rows, a source table had no data, or an upstream process produced no output. Performing operations on an empty DataFrame can lead to unexpected errors or wasted computation.

In this guide, you'll learn multiple reliable methods to check if a PySpark DataFrame is empty, along with their performance trade-offs so you can choose the best approach for your use case.

Setting Up the Example

Let's start by creating both an empty and a non-empty DataFrame to test each method:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

# Create Spark session
spark = SparkSession.builder.appName("CheckEmptyDF").getOrCreate()

# Define schema
schema = StructType([
StructField('Country', StringType(), True),
StructField('City', StringType(), True),
StructField('Capital', StringType(), True)
])

# Create an empty DataFrame
empty_df = spark.createDataFrame([], schema)

# Create a non-empty DataFrame
data = [
("India", "Mumbai", "New Delhi"),
("USA", "New York", "Washington D.C."),
("UK", "Manchester", "London")
]
non_empty_df = spark.createDataFrame(data, schema)

print("Empty DataFrame:")
empty_df.show()

print("Non-empty DataFrame:")
non_empty_df.show()

Output:

Empty DataFrame:
+-------+----+-------+
|Country|City|Capital|
+-------+----+-------+
+-------+----+-------+

Non-empty DataFrame:
+-------+----------+---------------+
|Country| City| Capital|
+-------+----------+---------------+
| India| Mumbai| New Delhi|
| USA| New York|Washington D.C.|
| UK|Manchester| London|
+-------+----------+---------------+

Using rdd.isEmpty()

The rdd.isEmpty() method checks whether the underlying RDD of the DataFrame contains any elements. It returns True if the DataFrame is empty and False otherwise.

print(f"Empty DF:     {empty_df.rdd.isEmpty()}")
print(f"Non-empty DF: {non_empty_df.rdd.isEmpty()}")

Output:

Empty DF:     True
Non-empty DF: False
note

rdd.isEmpty() internally evaluates only the first partition to determine if any data exists. This makes it more efficient than count() for large DataFrames, but it still involves converting the DataFrame to an RDD, which has overhead.

Using head() or first() with Exception Handling

Calling head(1) returns a list with the first row (or an empty list if the DataFrame has no rows). This is one of the fastest methods because PySpark only needs to evaluate a single row.

# Using head(1) : returns a list
is_empty = len(empty_df.head(1)) == 0
print(f"Empty DF: {is_empty}")

is_empty = len(non_empty_df.head(1)) == 0
print(f"Non-empty DF: {is_empty}")

Output:

Empty DF: True
Non-empty DF: False
Be Careful with first()

The first() method raises an exception on empty DataFrames instead of returning an empty result:

empty_df.first()

Output:

Py4JJavaError: An error occurred while calling o42.first.
java.util.NoSuchElementException: next on empty iterator

If you use first(), always wrap it in a try-except block:

try:
empty_df.first()
print("DataFrame is NOT empty.")
except Exception:
print("DataFrame is empty.")

Output:

DataFrame is empty.

For safety and simplicity, prefer head(1) over first().

Using count()

The count() method returns the total number of rows in the DataFrame. Comparing it to zero is a straightforward way to check for emptiness.

print(f"Empty DF row count:     {empty_df.count()}")
print(f"Non-empty DF row count: {non_empty_df.count()}")

# Check emptiness
print(f"\nEmpty DF is empty: {empty_df.count() == 0}")
print(f"Non-empty DF is empty: {non_empty_df.count() == 0}")

Output:

Empty DF row count:     0
Non-empty DF row count: 3

Empty DF is empty: True
Non-empty DF is empty: False
Performance Warning

count() scans all partitions across all nodes to compute the exact total. For large DataFrames with millions of rows, this can be very expensive. If you only need to know whether the DataFrame is empty or not, use head(1) or limit(1).count() instead, since they stop after finding a single row.

# More efficient alternative
is_empty = empty_df.limit(1).count() == 0
print(f"Empty DF: {is_empty}")

Using take(1)

The take(n) method returns the first n rows as a list. Similar to head(1), it's efficient because it only needs to find one row to confirm the DataFrame isn't empty.

is_empty = len(empty_df.take(1)) == 0
print(f"Empty DF: {is_empty}")

is_empty = len(non_empty_df.take(1)) == 0
print(f"Non-empty DF: {is_empty}")

Output:

Empty DF: True
Non-empty DF: False

Creating a Reusable Utility Function

For production code, wrap the check in a reusable function using the most efficient method:

def is_dataframe_empty(df):
"""Check if a PySpark DataFrame is empty efficiently."""
return len(df.head(1)) == 0


# Usage
print(f"Empty DF: {is_dataframe_empty(empty_df)}")
print(f"Non-empty DF: {is_dataframe_empty(non_empty_df)}")

Output:

Empty DF:     True
Non-empty DF: False

You can also create a version that provides more context:

def check_dataframe(df, name="DataFrame"):
"""Check if a DataFrame is empty and print a summary."""
if len(df.head(1)) == 0:
print(f"⚠️ '{name}' is empty (schema: {df.columns})")
return False
else:
row_count = df.count()
print(f"✅ '{name}' has {row_count} rows and {len(df.columns)} columns.")
return True


check_dataframe(empty_df, "Countries (empty)")
check_dataframe(non_empty_df, "Countries")

Output:

⚠️  'Countries (empty)' is empty (schema: ['Country', 'City', 'Capital'])
✅ 'Countries' has 3 rows and 3 columns.

Performance Comparison

MethodScans All DataSpeed on Large DataFramesSafe on Empty DF
head(1) / len(df.head(1)) == 0❌ (1 row only)⚡ Fastest✅ Yes
take(1) / len(df.take(1)) == 0❌ (1 row only)⚡ Fastest✅ Yes
limit(1).count() == 0❌ (1 row only)⚡ Fast✅ Yes
rdd.isEmpty()❌ (1st partition)🔶 Moderate (RDD conversion)✅ Yes
count() == 0✅ (all partitions)🐢 Slowest✅ Yes
first()❌ (1 row only)⚡ Fast❌ Raises exception
Best Practice

For production PySpark pipelines, len(df.head(1)) == 0 is the recommended approach. It's fast, safe, and doesn't trigger a full data scan. Reserve count() for situations where you actually need the exact row count.

Conclusion

Checking whether a PySpark DataFrame is empty is essential for building robust data pipelines that handle edge cases gracefully. Here's a quick summary of when to use each method:

  • head(1) or take(1): the fastest and safest options for a simple empty check. They evaluate only one row and return an empty list if no data exists.
  • rdd.isEmpty(): a readable alternative, though it involves RDD conversion overhead.
  • limit(1).count(): a clean middle ground that avoids full scans.
  • count() == 0: use only when you need the exact row count anyway, as it scans all partitions.
  • Avoid first() on potentially empty DataFrames unless wrapped in a try-except block.

For most scenarios, checking len(df.head(1)) == 0 gives you the best balance of performance, readability, and safety.