Python PySpark: How to Select a Range of Rows from a DataFrame in PySpark
When working with large datasets in PySpark, you'll often need to filter rows based on specific ranges - for example, selecting records where a score falls between two values, dates within a time period, or IDs within a certain range. PySpark provides several methods to accomplish this, including filter(), where(), and SQL expressions.
In this guide, you'll learn three approaches to select a range of rows from a PySpark DataFrame, understand the differences between them, and see practical examples for different data types.
Creating a Sample DataFrame
Let's start by creating a PySpark DataFrame to use throughout all examples:
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName('row_selection').getOrCreate()
# Sample student data
data = [
["1", "Sravan", "Vignan", 67, 89],
["2", "Ojaswi", "VVIT", 78, 89],
["3", "Rohith", "VVIT", 100, 80],
["4", "Sridevi", "Vignan", 78, 80],
["5", "Sravan", "Vignan", 89, 98],
["6", "Gnanesh", "IIT", 94, 98]
]
columns = ['ID', 'Name', 'College', 'Subject1', 'Subject2']
df = spark.createDataFrame(data, columns)
df.show()
Output:
+---+-------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
| 1| Sravan| Vignan| 67| 89|
| 2| Ojaswi| VVIT| 78| 89|
| 3| Rohith| VVIT| 100| 80|
| 4|Sridevi| Vignan| 78| 80|
| 5| Sravan| Vignan| 89| 98|
| 6|Gnanesh| IIT| 94| 98|
+---+-------+-------+--------+--------+
Method 1: Using filter()
The filter() method selects rows based on a given condition. Combined with the between() function, it provides a clean way to select rows within a range.
Selecting a Numeric Range
# Select rows where Subject1 score is between 70 and 90 (inclusive)
result = df.filter(df.Subject1.between(70, 90))
result.show()
Output:
+---+-------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
| 2| Ojaswi| VVIT| 78| 89|
| 4|Sridevi| Vignan| 78| 80|
| 5| Sravan| Vignan| 89| 98|
+---+-------+-------+--------+--------+
Using Comparison Operators
Instead of between(), you can use explicit comparison operators for more control:
# Equivalent to between(70, 90) but with explicit operators
result = df.filter((df.Subject1 >= 70) & (df.Subject1 <= 90))
result.show()
Output:
+---+-------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
| 2| Ojaswi| VVIT| 78| 89|
| 4|Sridevi| Vignan| 78| 80|
| 5| Sravan| Vignan| 89| 98|
+---+-------+-------+--------+--------+
Combining Multiple Conditions
# Select rows where Subject1 is between 70–90 AND Subject2 is above 85
result = df.filter(
(df.Subject1.between(70, 90)) & (df.Subject2 > 85)
)
result.show()
Output:
+---+------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+------+-------+--------+--------+
| 2|Ojaswi| VVIT| 78| 89|
| 5|Sravan| Vignan| 89| 98|
+---+------+-------+--------+--------+
When combining conditions in PySpark, always use & (AND), | (OR), and ~ (NOT) operators, and wrap each condition in parentheses. Python's and, or, and not keywords will not work with PySpark column expressions.
Method 2: Using where()
The where() method is functionally identical to filter() - they are aliases for the same operation. Use whichever reads better in your context:
# Select rows where Subject1 is between 85 and 100
result = df.where(df.Subject1.between(85, 100))
result.show()
Output:
+---+-------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
| 3| Rohith| VVIT| 100| 80|
| 5| Sravan| Vignan| 89| 98|
| 6|Gnanesh| IIT| 94| 98|
+---+-------+-------+--------+--------+
Filtering on String Columns
You can also use between() on string columns. It performs lexicographic (alphabetical) comparison:
# Select rows where College is alphabetically between "IIT" and "VVIT"
result = df.where(df.College.between("IIT", "VVIT"))
result.show()
Output:
+---+-------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
| 2| Ojaswi| VVIT| 78| 89|
| 3| Rohith| VVIT| 100| 80|
| 6|Gnanesh| IIT| 94| 98|
+---+-------+-------+--------+--------+
filter() vs where()In PySpark, filter() and where() are exact aliases - they produce the same execution plan and the same results. The distinction is purely stylistic:
filter()is more Pythonic and commonly used in PySpark code.where()is more familiar to SQL users.
Choose whichever makes your code more readable for your team.
Method 3: Using SQL Expressions
PySpark allows you to write SQL queries directly against DataFrames by registering them as temporary views. This is especially useful for teams familiar with SQL:
Setting Up a Temporary View
# Register the DataFrame as a temporary SQL view
df.createOrReplaceTempView("students")
Selecting a Range with SQL
# Select rows where Subject1 is between 70 and 90
result = spark.sql("""
SELECT * FROM students
WHERE Subject1 BETWEEN 70 AND 90
""")
result.show()
Output:
+---+-------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
| 2| Ojaswi| VVIT| 78| 89|
| 4|Sridevi| Vignan| 78| 80|
| 5| Sravan| Vignan| 89| 98|
+---+-------+-------+--------+--------+
Complex SQL Queries
# Multiple conditions with SQL
result = spark.sql("""
SELECT * FROM students
WHERE Subject1 BETWEEN 70 AND 95
AND College = 'Vignan'
ORDER BY Subject1 DESC
""")
result.show()
Output:
+---+-------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
| 5| Sravan| Vignan| 89| 98|
| 4|Sridevi| Vignan| 78| 80|
+---+-------+-------+--------+--------+
Selecting by ID Range
# Select rows where ID is between 2 and 4
result = spark.sql("""
SELECT * FROM students
WHERE ID BETWEEN 2 AND 4
""")
result.show()
Output:
+---+-------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
| 2| Ojaswi| VVIT| 78| 89|
| 3| Rohith| VVIT| 100| 80|
| 4|Sridevi| Vignan| 78| 80|
+---+-------+-------+--------+--------+
Selecting Rows by Position (Row Number)
Unlike Pandas, PySpark DataFrames don't have a built-in positional index. To select rows by position (e.g., rows 2 through 4), you need to add a row number column:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
# Add a row number column
window = Window.orderBy("ID")
df_with_row_num = df.withColumn("row_num", F.row_number().over(window))
# Select rows 2 through 4 by position
result = df_with_row_num.filter(
F.col("row_num").between(2, 4)
).drop("row_num")
result.show()
Output:
+---+-------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+-------+-------+--------+--------+
| 2| Ojaswi| VVIT| 78| 89|
| 3| Rohith| VVIT| 100| 80|
| 4|Sridevi| Vignan| 78| 80|
+---+-------+-------+--------+--------+
row_number() requires a Window with an orderBy clause. The ordering determines which rows get which numbers. Without a deterministic order, row numbers may vary between executions.
Using limit() and head() for Top N Rows
If you simply need the first N rows without a specific condition:
# Get the first 3 rows as a DataFrame
top_3 = df.limit(3)
top_3.show()
Output:
+---+------+-------+--------+--------+
| ID| Name|College|Subject1|Subject2|
+---+------+-------+--------+--------+
| 1|Sravan| Vignan| 67| 89|
| 2|Ojaswi| VVIT| 78| 89|
| 3|Rohith| VVIT| 100| 80|
+---+------+-------+--------+--------+
Comparison of Methods
| Method | Syntax | Best For |
|---|---|---|
filter() | df.filter(condition) | Pythonic PySpark code |
where() | df.where(condition) | SQL-familiar users |
| SQL expression | spark.sql("SELECT ... WHERE ... BETWEEN") | Complex queries, team collaboration |
row_number() | Window function + filter | Positional row selection |
limit() | df.limit(n) | Selecting the first N rows |
Summary
To select a range of rows from a PySpark DataFrame:
- Use
filter()orwhere()with thebetween()function for range-based filtering - they're functionally identical aliases. - Use SQL expressions with
BETWEENwhen writing complex queries or when your team prefers SQL syntax. - Use
row_number()with a Window function when you need to select rows by their position rather than by column values. - Always use
&,|,~operators (notand,or,not) when combining multiple conditions, and wrap each condition in parentheses.