Python PySpark: How to Extract a Single Value from a PySpark DataFrame

When working with PySpark DataFrames, you often need to retrieve a single specific value - such as checking the first entry in a column, pulling a computed result from an aggregation, or accessing a configuration value stored in a one-row DataFrame. Unlike Pandas, PySpark DataFrames are distributed, so you cannot simply index into them with bracket notation directly on the DataFrame object.

In this guide, you will learn multiple ways to extract a single value from a PySpark DataFrame using first(), head(), collect(), and other techniques, along with their differences and best practices.

Setting Up the Example DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("extract_value").getOrCreate()

data = [
    ["1", "Sravan", "Vignan"],
    ["2", "Ojaswi", "VVIT"],
    ["3", "Rohith", "VVIT"],
    ["4", "Sridevi", "Vignan"],
    ["1", "Sravan", "Vignan"],
    ["5", "Gnanesh", "IIT"],
]

columns = ["Student_ID", "Student_Name", "College"]

df = spark.createDataFrame(data, columns)
df.show()

Output:

+----------+------------+-------+
|Student_ID|Student_Name|College|
+----------+------------+-------+
|         1|      Sravan| Vignan|
|         2|      Ojaswi|   VVIT|
|         3|      Rohith|   VVIT|
|         4|     Sridevi| Vignan|
|         1|      Sravan| Vignan|
|         5|     Gnanesh|    IIT|
+----------+------------+-------+

Method 1: Using `first()`

The first() method returns the first row of the DataFrame as a Row object. You can then access a specific column's value by column name or index position.

Access by Column Name

value = df.first()["Student_ID"]
print(value)

Output:

Access by Index Position

value = df.first()[1]
print(value)

Output:

Sravan

tip

Accessing by column name (first()["Student_Name"]) is more readable and less error-prone than using index positions (first()[1]), especially when the DataFrame schema changes over time.

Method 2: Using `head()`

The head() method works similarly to first() when called without arguments - it returns the first row as a Row object.

Access by Column Name

value = df.head()["College"]
print(value)

Output:

Vignan

Access by Index Position

value = df.head()[0]
print(value)

Output:

Retrieving Multiple Rows with `head(n)`

When you pass an integer argument, head(n) returns a list of the first n rows. You can then access a specific row and column:

# Get the first 3 rows
rows = df.head(3)

# Access the Student_Name from the second row (index 1)
value = rows[1]["Student_Name"]
print(value)

Output:

Ojaswi

Method 3: Using `collect()`

The collect() method retrieves all rows from the DataFrame as a list of Row objects to the driver. You can then index into any specific row and column.

all_rows = df.collect()

# Access value from the 4th row, "College" column
value = all_rows[3]["College"]
print(value)

Output:

Vignan

warning

collect() brings the entire DataFrame into the driver's memory. For large DataFrames, this can cause OutOfMemoryError. Only use collect() when you are certain the data is small enough to fit in memory, or when you specifically need to access rows beyond the first few.

Method 4: Using `take(n)`

The take(n) method is similar to head(n) - it returns the first n rows as a list. It is functionally equivalent but is sometimes preferred in certain codebases.

rows = df.take(2)

# Access the Student_Name from the first row
value = rows[0]["Student_Name"]
print(value)

Output:

Sravan

Extracting a Value from a Filtered or Aggregated DataFrame

In practice, you often extract a value after filtering or aggregating data. Here are common patterns:

Extract After Filtering

# Get the college of the student named "Gnanesh"
value = df.filter(df["Student_Name"] == "Gnanesh").first()["College"]
print(value)

Output:

IIT

Extract an Aggregation Result

from pyspark.sql.functions import countDistinct

# Count distinct colleges and extract the value
result = df.select(countDistinct("College").alias("Unique_Colleges"))
value = result.first()["Unique_Colleges"]
print(f"Number of unique colleges: {value}")

Output:

Number of unique colleges: 3

Extract Using `first()` on a Specific Column

You can combine select() with first() to extract a value from a specific column directly:

value = df.select("Student_Name").first()[0]
print(value)

Output:

Sravan

Common Mistake: Calling `first()` on an Empty DataFrame

If the DataFrame is empty, first() raises a ValueError and head() returns None (when called without arguments) or an empty list (when called with n).

Problem:

empty_df = df.filter(df["Student_Name"] == "NonExistent")

# This will raise an error
value = empty_df.first()["Student_Name"]

Error:

ValueError: first() called on an empty DataFrame

Solution - check before accessing:

empty_df = df.filter(df["Student_Name"] == "NonExistent")

first_row = empty_df.head()
if first_row is not None:
    print(first_row["Student_Name"])
else:
    print("No matching records found.")

Output:

No matching records found.

tip

Another safe pattern uses take(1):

rows = empty_df.take(1)
if rows:
    print(rows[0]["Student_Name"])
else:
    print("No matching records found.")

This avoids exceptions entirely since take() always returns a list (which may be empty).

Comparison of Methods

Method	Returns	Fetches to Driver	Best For
`first()`	Single `Row`	1 row	Quick access to the first row
`head()`	Single `Row` or `None`	1 row	Safe access (returns `None` if empty)
`head(n)`	List of `Row`	`n` rows	Accessing the first few rows
`take(n)`	List of `Row`	`n` rows	Same as `head(n)`
`collect()`	List of all `Row`	All rows	Accessing any row (small DataFrames only)

Summary

Extracting a single value from a PySpark DataFrame involves bringing data from the distributed cluster to the driver. Key takeaways:

Use first() or head() for quick access to a value in the first row - access by column name for clarity.
Use head(n) or take(n) when you need values from a specific row beyond the first.
Use collect() sparingly and only on small DataFrames to avoid memory issues.
Always check for empty DataFrames before calling first() to prevent ValueError exceptions.
Combine with filter(), select(), or aggregate functions to extract computed or conditional values.

Setting Up the Example DataFrame​

Method 1: Using first()​

Access by Column Name​

Access by Index Position​

Method 2: Using head()​

Access by Column Name​

Access by Index Position​

Retrieving Multiple Rows with head(n)​

Method 3: Using collect()​

Method 4: Using take(n)​

Extracting a Value from a Filtered or Aggregated DataFrame​

Extract After Filtering​

Extract an Aggregation Result​

Extract Using first() on a Specific Column​

Common Mistake: Calling first() on an Empty DataFrame​

Comparison of Methods​

Summary​

Table of Contents

Setting Up the Example DataFrame

Method 1: Using `first()`

Access by Column Name

Access by Index Position

Method 2: Using `head()`

Access by Column Name

Access by Index Position

Retrieving Multiple Rows with `head(n)`

Method 3: Using `collect()`

Method 4: Using `take(n)`

Extracting a Value from a Filtered or Aggregated DataFrame

Extract After Filtering

Extract an Aggregation Result

Extract Using `first()` on a Specific Column

Common Mistake: Calling `first()` on an Empty DataFrame

Comparison of Methods

Summary