Python PySpark: How to Order by Multiple Columns in PySpark
Sorting data is one of the most common operations when working with DataFrames. In PySpark, you frequently need to sort by multiple columns - for example, sorting employees first by department and then by name within each department. PySpark provides two primary methods for this: orderBy() and sort(), both of which support multi-column sorting with flexible ascending and descending options.
In this guide, you will learn how to sort PySpark DataFrames by multiple columns using both methods, including how to apply different sort directions per column - a common real-world requirement.
Setting Up the Example DataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sorting_demo").getOrCreate()
data = [
["1", "James", "Company 1", 45000],
["2", "Emily", "Company 1", 52000],
["3", "Michael", "Company 2", 45000],
["4", "Sarah", "Company 1", 60000],
["5", "David", "Company 1", 48000],
["6", "Michael", "Company 1", 39000],
]
columns = ["ID", "Name", "Company", "Salary"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 1| James|Company 1| 45000|
| 2| Emily|Company 1| 52000|
| 3|Michael|Company 2| 45000|
| 4| Sarah|Company 1| 60000|
| 5| David|Company 1| 48000|
| 6|Michael|Company 1| 39000|
+---+-------+---------+------+
Method 1: Using orderBy()
The orderBy() method sorts the DataFrame by one or more columns. You can control the sort direction using the ascending parameter.
Ascending Order (Default)
df.orderBy(["Name", "ID"], ascending=True).show()
Output:
+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 5| David|Company 1| 48000|
| 2| Emily|Company 1| 52000|
| 1| James|Company 1| 45000|
| 3|Michael|Company 2| 45000|
| 6|Michael|Company 1| 39000|
| 4| Sarah|Company 1| 60000|
+---+-------+---------+------+
The DataFrame is sorted by Name first (alphabetically), and where names are the same (Michael appears twice), it sorts by ID.
Descending Order
df.orderBy(["Name", "ID"], ascending=False).show()
Output:
+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
| 1| James|Company 1| 45000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
+---+-------+---------+------+
Method 2: Using sort()
The sort() method works identically to orderBy(). They are aliases of each other - you can use whichever reads more naturally in your code.
Ascending Order
df.sort(["Name", "ID"], ascending=True).show()
Output:
+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 5| David|Company 1| 48000|
| 2| Emily|Company 1| 52000|
| 1| James|Company 1| 45000|
| 3|Michael|Company 2| 45000|
| 6|Michael|Company 1| 39000|
| 4| Sarah|Company 1| 60000|
+---+-------+---------+------+
Descending Order
df.sort(["Name", "ID"], ascending=False).show()
Output:
+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
| 1| James|Company 1| 45000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
+---+-------+---------+------+
sort() and orderBy() are functionally identical in PySpark. Use whichever you prefer - orderBy() is more common in SQL-influenced codebases, while sort() is more concise.
Mixed Sort Directions per Column
A common real-world requirement is sorting by one column in ascending order and another in descending order - for example, sorting by Company ascending and Salary descending to see the highest earners at the top within each company.
Using a List of Booleans
Pass a list of boolean values to ascending, matching the order of columns:
# Company ascending, Salary descending
df.orderBy(["Company", "Salary"], ascending=[True, False]).show()
Output:
+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
| 1| James|Company 1| 45000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
+---+-------+---------+------+
Within Company 1, employees are sorted from highest to lowest salary.
Using col() with .asc() and .desc()
For more explicit control, use col() with .asc() or .desc() methods:
from pyspark.sql.functions import col
df.orderBy(col("Company").asc(), col("Salary").desc()).show()
Output:
+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
| 1| James|Company 1| 45000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
+---+-------+---------+------+
The col().asc() / col().desc() syntax is the most readable approach, especially when mixing sort directions across many columns. It also supports null ordering:
from pyspark.sql.functions import col
# Nulls first in ascending, nulls last in descending
df.orderBy(
col("Company").asc_nulls_first(),
col("Salary").desc_nulls_last()
).show()
Handling Null Values in Sorting
By default, PySpark places null values last in ascending order and first in descending order. You can control this behavior explicitly:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
spark = SparkSession.builder.getOrCreate()
data = [
["1", "James", 45000],
["2", "Emily", None],
["3", "Michael", 52000],
["4", "David", None],
]
df_nulls = spark.createDataFrame(data, ["ID", "Name", "Salary"])
# Default behavior: nulls last in ascending
print("Default ascending (nulls last):")
df_nulls.orderBy(col("Salary").asc()).show()
# Nulls first in ascending
print("Ascending with nulls first:")
df_nulls.orderBy(col("Salary").asc_nulls_first()).show()
Output:
Default ascending (nulls last):
+---+-------+------+
| ID| Name|Salary|
+---+-------+------+
| 1| James| 45000|
| 3|Michael| 52000|
| 2| Emily| null|
| 4| David| null|
+---+-------+------+
Ascending with nulls first:
+---+-------+------+
| ID| Name|Salary|
+---+-------+------+
| 2| Emily| null|
| 4| David| null|
| 1| James| 45000|
| 3|Michael| 52000|
+---+-------+------+
Sorting with SQL Queries
You can also sort using SQL syntax after registering the DataFrame as a temporary view:
df.createOrReplaceTempView("employees")
spark.sql("""
SELECT * FROM employees
ORDER BY Company ASC, Salary DESC, Name ASC
""").show()
Output:
+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
| 1| James|Company 1| 45000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
+---+-------+---------+------+
Common Mistake: Mismatched ascending List Length
When passing a list to the ascending parameter, it must have the same number of elements as the column list. A mismatch can produce unexpected results or errors.
Wrong:3 columns but only 2 ascending values
# This may raise an error or produce unexpected behavior
df.orderBy(["Name", "Company", "Salary"], ascending=[True, False]).show()
Correct:match the number of columns
df.orderBy(
["Name", "Company", "Salary"],
ascending=[True, False, True]
).show()
Always ensure the ascending list has the same number of elements as the columns list. For better safety and readability, prefer the col().asc() / col().desc() syntax, which makes the direction explicit for each column.
Comparison of Approaches
| Approach | Syntax | Mixed Directions | Null Control | Readability |
|---|---|---|---|---|
orderBy() with ascending=True/False | orderBy(["A", "B"], ascending=True) | ✅ Via list | ❌ No | Good |
orderBy() with ascending list | orderBy(["A", "B"], ascending=[True, False]) | ✅ Yes | ❌ No | Moderate |
col().asc() / col().desc() | orderBy(col("A").asc(), col("B").desc()) | ✅ Yes | ✅ Yes | Best |
SQL ORDER BY | ORDER BY A ASC, B DESC | ✅ Yes | ✅ Yes (NULLS FIRST/LAST) | Best for SQL users |
Summary
Sorting PySpark DataFrames by multiple columns is straightforward with the right approach:
orderBy()andsort()are identical - use whichever you prefer.- Pass a list of column names with
ascending=Trueorascending=Falsefor uniform sort direction. - Pass a list of booleans to
ascendingfor mixed sort directions (e.g.,ascending=[True, False]). - Use
col().asc()andcol().desc()for the most explicit and readable sorting, with support for null ordering viaasc_nulls_first(),asc_nulls_last(),desc_nulls_first(), anddesc_nulls_last(). - Always ensure the
ascendinglist length matches the number of columns to avoid unexpected behavior.