Skip to main content

Python PySpark: How to Extract a Single Value from a PySpark DataFrame

When working with PySpark DataFrames, you often need to retrieve a single specific value - such as checking the first entry in a column, pulling a computed result from an aggregation, or accessing a configuration value stored in a one-row DataFrame. Unlike Pandas, PySpark DataFrames are distributed, so you cannot simply index into them with bracket notation directly on the DataFrame object.

In this guide, you will learn multiple ways to extract a single value from a PySpark DataFrame using first(), head(), collect(), and other techniques, along with their differences and best practices.

Setting Up the Example DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("extract_value").getOrCreate()

data = [
["1", "Sravan", "Vignan"],
["2", "Ojaswi", "VVIT"],
["3", "Rohith", "VVIT"],
["4", "Sridevi", "Vignan"],
["1", "Sravan", "Vignan"],
["5", "Gnanesh", "IIT"],
]

columns = ["Student_ID", "Student_Name", "College"]

df = spark.createDataFrame(data, columns)
df.show()

Output:

+----------+------------+-------+
|Student_ID|Student_Name|College|
+----------+------------+-------+
| 1| Sravan| Vignan|
| 2| Ojaswi| VVIT|
| 3| Rohith| VVIT|
| 4| Sridevi| Vignan|
| 1| Sravan| Vignan|
| 5| Gnanesh| IIT|
+----------+------------+-------+

Method 1: Using first()

The first() method returns the first row of the DataFrame as a Row object. You can then access a specific column's value by column name or index position.

Access by Column Name

value = df.first()["Student_ID"]
print(value)

Output:

1

Access by Index Position

value = df.first()[1]
print(value)

Output:

Sravan
tip

Accessing by column name (first()["Student_Name"]) is more readable and less error-prone than using index positions (first()[1]), especially when the DataFrame schema changes over time.

Method 2: Using head()

The head() method works similarly to first() when called without arguments - it returns the first row as a Row object.

Access by Column Name

value = df.head()["College"]
print(value)

Output:

Vignan

Access by Index Position

value = df.head()[0]
print(value)

Output:

1

Retrieving Multiple Rows with head(n)

When you pass an integer argument, head(n) returns a list of the first n rows. You can then access a specific row and column:

# Get the first 3 rows
rows = df.head(3)

# Access the Student_Name from the second row (index 1)
value = rows[1]["Student_Name"]
print(value)

Output:

Ojaswi

Method 3: Using collect()

The collect() method retrieves all rows from the DataFrame as a list of Row objects to the driver. You can then index into any specific row and column.

all_rows = df.collect()

# Access value from the 4th row, "College" column
value = all_rows[3]["College"]
print(value)

Output:

Vignan
caution

collect() brings the entire DataFrame into the driver's memory. For large DataFrames, this can cause OutOfMemoryError. Only use collect() when you are certain the data is small enough to fit in memory, or when you specifically need to access rows beyond the first few.

Method 4: Using take(n)

The take(n) method is similar to head(n) - it returns the first n rows as a list. It is functionally equivalent but is sometimes preferred in certain codebases.

rows = df.take(2)

# Access the Student_Name from the first row
value = rows[0]["Student_Name"]
print(value)

Output:

Sravan

Extracting a Value from a Filtered or Aggregated DataFrame

In practice, you often extract a value after filtering or aggregating data. Here are common patterns:

Extract After Filtering

# Get the college of the student named "Gnanesh"
value = df.filter(df["Student_Name"] == "Gnanesh").first()["College"]
print(value)

Output:

IIT

Extract an Aggregation Result

from pyspark.sql.functions import countDistinct

# Count distinct colleges and extract the value
result = df.select(countDistinct("College").alias("Unique_Colleges"))
value = result.first()["Unique_Colleges"]
print(f"Number of unique colleges: {value}")

Output:

Number of unique colleges: 3

Extract Using first() on a Specific Column

You can combine select() with first() to extract a value from a specific column directly:

value = df.select("Student_Name").first()[0]
print(value)

Output:

Sravan

Common Mistake: Calling first() on an Empty DataFrame

If the DataFrame is empty, first() raises a ValueError and head() returns None (when called without arguments) or an empty list (when called with n).

Problem:

empty_df = df.filter(df["Student_Name"] == "NonExistent")

# This will raise an error
value = empty_df.first()["Student_Name"]

Error:

ValueError: first() called on an empty DataFrame

Solution - check before accessing:

empty_df = df.filter(df["Student_Name"] == "NonExistent")

first_row = empty_df.head()
if first_row is not None:
print(first_row["Student_Name"])
else:
print("No matching records found.")

Output:

No matching records found.
tip

Another safe pattern uses take(1):

rows = empty_df.take(1)
if rows:
print(rows[0]["Student_Name"])
else:
print("No matching records found.")

This avoids exceptions entirely since take() always returns a list (which may be empty).

Comparison of Methods

MethodReturnsFetches to DriverBest For
first()Single Row1 rowQuick access to the first row
head()Single Row or None1 rowSafe access (returns None if empty)
head(n)List of Rown rowsAccessing the first few rows
take(n)List of Rown rowsSame as head(n)
collect()List of all RowAll rowsAccessing any row (small DataFrames only)

Summary

Extracting a single value from a PySpark DataFrame involves bringing data from the distributed cluster to the driver. Key takeaways:

  • Use first() or head() for quick access to a value in the first row - access by column name for clarity.
  • Use head(n) or take(n) when you need values from a specific row beyond the first.
  • Use collect() sparingly and only on small DataFrames to avoid memory issues.
  • Always check for empty DataFrames before calling first() to prevent ValueError exceptions.
  • Combine with filter(), select(), or aggregate functions to extract computed or conditional values.