Python PySpark: How to Select a Range of Rows from a DataFrame in PySpark

When working with large datasets in PySpark, you'll often need to filter rows based on specific ranges - for example, selecting records where a score falls between two values, dates within a time period, or IDs within a certain range. PySpark provides several methods to accomplish this, including filter(), where(), and SQL expressions.

In this guide, you'll learn three approaches to select a range of rows from a PySpark DataFrame, understand the differences between them, and see practical examples for different data types.

Creating a Sample DataFrame

Let's start by creating a PySpark DataFrame to use throughout all examples:

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName('row_selection').getOrCreate()

# Sample student data
data = [
    ["1", "Sravan", "Vignan", 67, 89],
    ["2", "Ojaswi", "VVIT", 78, 89],
    ["3", "Rohith", "VVIT", 100, 80],
    ["4", "Sridevi", "Vignan", 78, 80],
    ["5", "Sravan", "Vignan", 89, 98],
    ["6", "Gnanesh", "IIT", 94, 98]
]

columns = ['ID', 'Name', 'College', 'Subject1', 'Subject2']

df = spark.createDataFrame(data, columns)
df.show()

Output:

+---+-------+-------+--------+--------+
| ID|   Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
|  1| Sravan| Vignan|      67|      89|
|  2| Ojaswi|   VVIT|      78|      89|
|  3| Rohith|   VVIT|     100|      80|
|  4|Sridevi| Vignan|      78|      80|
|  5| Sravan| Vignan|      89|      98|
|  6|Gnanesh|    IIT|      94|      98|
+---+-------+-------+--------+--------+

Method 1: Using `filter()`

The filter() method selects rows based on a given condition. Combined with the between() function, it provides a clean way to select rows within a range.

Selecting a Numeric Range

# Select rows where Subject1 score is between 70 and 90 (inclusive)
result = df.filter(df.Subject1.between(70, 90))
result.show()

Output:

+---+-------+-------+--------+--------+
| ID|   Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
|  2| Ojaswi|   VVIT|      78|      89|
|  4|Sridevi| Vignan|      78|      80|
|  5| Sravan| Vignan|      89|      98|
+---+-------+-------+--------+--------+

Using Comparison Operators

Instead of between(), you can use explicit comparison operators for more control:

# Equivalent to between(70, 90) but with explicit operators
result = df.filter((df.Subject1 >= 70) & (df.Subject1 <= 90))
result.show()

Output:

+---+-------+-------+--------+--------+
| ID|   Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
|  2| Ojaswi|   VVIT|      78|      89|
|  4|Sridevi| Vignan|      78|      80|
|  5| Sravan| Vignan|      89|      98|
+---+-------+-------+--------+--------+

Combining Multiple Conditions

# Select rows where Subject1 is between 70–90 AND Subject2 is above 85
result = df.filter(
    (df.Subject1.between(70, 90)) & (df.Subject2 > 85)
)
result.show()

Output:

+---+------+-------+--------+--------+
| ID|  Name|College|Subject1|Subject2|
+---+------+-------+--------+--------+
|  2|Ojaswi|   VVIT|      78|      89|
|  5|Sravan| Vignan|      89|      98|
+---+------+-------+--------+--------+

tip

When combining conditions in PySpark, always use & (AND), | (OR), and ~ (NOT) operators, and wrap each condition in parentheses. Python's and, or, and not keywords will not work with PySpark column expressions.

Method 2: Using `where()`

The where() method is functionally identical to filter() - they are aliases for the same operation. Use whichever reads better in your context:

# Select rows where Subject1 is between 85 and 100
result = df.where(df.Subject1.between(85, 100))
result.show()

Output:

+---+-------+-------+--------+--------+
| ID|   Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
|  3| Rohith|   VVIT|     100|      80|
|  5| Sravan| Vignan|      89|      98|
|  6|Gnanesh|    IIT|      94|      98|
+---+-------+-------+--------+--------+

Filtering on String Columns

You can also use between() on string columns. It performs lexicographic (alphabetical) comparison:

# Select rows where College is alphabetically between "IIT" and "VVIT"
result = df.where(df.College.between("IIT", "VVIT"))
result.show()

Output:

+---+-------+-------+--------+--------+
| ID|   Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
|  2| Ojaswi|   VVIT|      78|      89|
|  3| Rohith|   VVIT|     100|      80|
|  6|Gnanesh|    IIT|      94|      98|
+---+-------+-------+--------+--------+

filter() vs where()

In PySpark, filter() and where() are exact aliases - they produce the same execution plan and the same results. The distinction is purely stylistic:

filter() is more Pythonic and commonly used in PySpark code.
where() is more familiar to SQL users.

Choose whichever makes your code more readable for your team.

Method 3: Using SQL Expressions

PySpark allows you to write SQL queries directly against DataFrames by registering them as temporary views. This is especially useful for teams familiar with SQL:

Setting Up a Temporary View

# Register the DataFrame as a temporary SQL view
df.createOrReplaceTempView("students")

Selecting a Range with SQL

# Select rows where Subject1 is between 70 and 90
result = spark.sql("""
    SELECT * FROM students 
    WHERE Subject1 BETWEEN 70 AND 90
""")
result.show()

Output:

+---+-------+-------+--------+--------+
| ID|   Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
|  2| Ojaswi|   VVIT|      78|      89|
|  4|Sridevi| Vignan|      78|      80|
|  5| Sravan| Vignan|      89|      98|
+---+-------+-------+--------+--------+

Complex SQL Queries

# Multiple conditions with SQL
result = spark.sql("""
    SELECT * FROM students 
    WHERE Subject1 BETWEEN 70 AND 95
    AND College = 'Vignan'
    ORDER BY Subject1 DESC
""")
result.show()

Output:

+---+-------+-------+--------+--------+
| ID|   Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
|  5| Sravan| Vignan|      89|      98|
|  4|Sridevi| Vignan|      78|      80|
+---+-------+-------+--------+--------+

Selecting by ID Range

# Select rows where ID is between 2 and 4
result = spark.sql("""
    SELECT * FROM students 
    WHERE ID BETWEEN 2 AND 4
""")
result.show()

Output:

+---+-------+-------+--------+--------+
| ID|   Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
|  2| Ojaswi|   VVIT|      78|      89|
|  3| Rohith|   VVIT|     100|      80|
|  4|Sridevi| Vignan|      78|      80|
+---+-------+-------+--------+--------+

Selecting Rows by Position (Row Number)

Unlike Pandas, PySpark DataFrames don't have a built-in positional index. To select rows by position (e.g., rows 2 through 4), you need to add a row number column:

from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Add a row number column
window = Window.orderBy("ID")
df_with_row_num = df.withColumn("row_num", F.row_number().over(window))

# Select rows 2 through 4 by position
result = df_with_row_num.filter(
    F.col("row_num").between(2, 4)
).drop("row_num")

result.show()

Output:

+---+-------+-------+--------+--------+
| ID|   Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
|  2| Ojaswi|   VVIT|      78|      89|
|  3| Rohith|   VVIT|     100|      80|
|  4|Sridevi| Vignan|      78|      80|
+---+-------+-------+--------+--------+

warning

row_number() requires a Window with an orderBy clause. The ordering determines which rows get which numbers. Without a deterministic order, row numbers may vary between executions.

Using `limit()` and `head()` for Top N Rows

If you simply need the first N rows without a specific condition:

# Get the first 3 rows as a DataFrame
top_3 = df.limit(3)
top_3.show()

Output:

+---+------+-------+--------+--------+
| ID|  Name|College|Subject1|Subject2|
+---+------+-------+--------+--------+
|  1|Sravan| Vignan|      67|      89|
|  2|Ojaswi|   VVIT|      78|      89|
|  3|Rohith|   VVIT|     100|      80|
+---+------+-------+--------+--------+

Comparison of Methods

Method	Syntax	Best For
`filter()`	`df.filter(condition)`	Pythonic PySpark code
`where()`	`df.where(condition)`	SQL-familiar users
SQL expression	`spark.sql("SELECT ... WHERE ... BETWEEN")`	Complex queries, team collaboration
`row_number()`	Window function + filter	Positional row selection
`limit()`	`df.limit(n)`	Selecting the first N rows

Summary

To select a range of rows from a PySpark DataFrame:

Use filter() or where() with the between() function for range-based filtering - they're functionally identical aliases.
Use SQL expressions with BETWEEN when writing complex queries or when your team prefers SQL syntax.
Use row_number() with a Window function when you need to select rows by their position rather than by column values.
Always use &, |, ~ operators (not and, or, not) when combining multiple conditions, and wrap each condition in parentheses.

Creating a Sample DataFrame​

Method 1: Using filter()​

Selecting a Numeric Range​

Using Comparison Operators​

Combining Multiple Conditions​

Method 2: Using where()​

Filtering on String Columns​

Method 3: Using SQL Expressions​

Setting Up a Temporary View​

Selecting a Range with SQL​

Complex SQL Queries​

Selecting by ID Range​

Selecting Rows by Position (Row Number)​

Using limit() and head() for Top N Rows​

Comparison of Methods​

Summary​

Table of Contents

Creating a Sample DataFrame

Method 1: Using `filter()`

Selecting a Numeric Range

Using Comparison Operators

Combining Multiple Conditions

Method 2: Using `where()`

Filtering on String Columns

Method 3: Using SQL Expressions

Setting Up a Temporary View

Selecting a Range with SQL

Complex SQL Queries

Selecting by ID Range

Selecting Rows by Position (Row Number)

Using `limit()` and `head()` for Top N Rows

Comparison of Methods

Summary