Skip to main content

Python PySpark: How to Order by Multiple Columns in PySpark

Sorting data is one of the most common operations when working with DataFrames. In PySpark, you frequently need to sort by multiple columns - for example, sorting employees first by department and then by name within each department. PySpark provides two primary methods for this: orderBy() and sort(), both of which support multi-column sorting with flexible ascending and descending options.

In this guide, you will learn how to sort PySpark DataFrames by multiple columns using both methods, including how to apply different sort directions per column - a common real-world requirement.

Setting Up the Example DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("sorting_demo").getOrCreate()

data = [
["1", "James", "Company 1", 45000],
["2", "Emily", "Company 1", 52000],
["3", "Michael", "Company 2", 45000],
["4", "Sarah", "Company 1", 60000],
["5", "David", "Company 1", 48000],
["6", "Michael", "Company 1", 39000],
]

columns = ["ID", "Name", "Company", "Salary"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 1| James|Company 1| 45000|
| 2| Emily|Company 1| 52000|
| 3|Michael|Company 2| 45000|
| 4| Sarah|Company 1| 60000|
| 5| David|Company 1| 48000|
| 6|Michael|Company 1| 39000|
+---+-------+---------+------+

Method 1: Using orderBy()

The orderBy() method sorts the DataFrame by one or more columns. You can control the sort direction using the ascending parameter.

Ascending Order (Default)

df.orderBy(["Name", "ID"], ascending=True).show()

Output:

+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 5| David|Company 1| 48000|
| 2| Emily|Company 1| 52000|
| 1| James|Company 1| 45000|
| 3|Michael|Company 2| 45000|
| 6|Michael|Company 1| 39000|
| 4| Sarah|Company 1| 60000|
+---+-------+---------+------+

The DataFrame is sorted by Name first (alphabetically), and where names are the same (Michael appears twice), it sorts by ID.

Descending Order

df.orderBy(["Name", "ID"], ascending=False).show()

Output:

+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
| 1| James|Company 1| 45000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
+---+-------+---------+------+

Method 2: Using sort()

The sort() method works identically to orderBy(). They are aliases of each other - you can use whichever reads more naturally in your code.

Ascending Order

df.sort(["Name", "ID"], ascending=True).show()

Output:

+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 5| David|Company 1| 48000|
| 2| Emily|Company 1| 52000|
| 1| James|Company 1| 45000|
| 3|Michael|Company 2| 45000|
| 6|Michael|Company 1| 39000|
| 4| Sarah|Company 1| 60000|
+---+-------+---------+------+

Descending Order

df.sort(["Name", "ID"], ascending=False).show()

Output:

+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
| 1| James|Company 1| 45000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
+---+-------+---------+------+
info

sort() and orderBy() are functionally identical in PySpark. Use whichever you prefer - orderBy() is more common in SQL-influenced codebases, while sort() is more concise.

Mixed Sort Directions per Column

A common real-world requirement is sorting by one column in ascending order and another in descending order - for example, sorting by Company ascending and Salary descending to see the highest earners at the top within each company.

Using a List of Booleans

Pass a list of boolean values to ascending, matching the order of columns:

# Company ascending, Salary descending
df.orderBy(["Company", "Salary"], ascending=[True, False]).show()

Output:

+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
| 1| James|Company 1| 45000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
+---+-------+---------+------+

Within Company 1, employees are sorted from highest to lowest salary.

Using col() with .asc() and .desc()

For more explicit control, use col() with .asc() or .desc() methods:

from pyspark.sql.functions import col

df.orderBy(col("Company").asc(), col("Salary").desc()).show()

Output:

+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
| 1| James|Company 1| 45000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
+---+-------+---------+------+
tip

The col().asc() / col().desc() syntax is the most readable approach, especially when mixing sort directions across many columns. It also supports null ordering:

from pyspark.sql.functions import col

# Nulls first in ascending, nulls last in descending
df.orderBy(
col("Company").asc_nulls_first(),
col("Salary").desc_nulls_last()
).show()

Handling Null Values in Sorting

By default, PySpark places null values last in ascending order and first in descending order. You can control this behavior explicitly:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

spark = SparkSession.builder.getOrCreate()

data = [
["1", "James", 45000],
["2", "Emily", None],
["3", "Michael", 52000],
["4", "David", None],
]

df_nulls = spark.createDataFrame(data, ["ID", "Name", "Salary"])

# Default behavior: nulls last in ascending
print("Default ascending (nulls last):")
df_nulls.orderBy(col("Salary").asc()).show()

# Nulls first in ascending
print("Ascending with nulls first:")
df_nulls.orderBy(col("Salary").asc_nulls_first()).show()

Output:

Default ascending (nulls last):
+---+-------+------+
| ID| Name|Salary|
+---+-------+------+
| 1| James| 45000|
| 3|Michael| 52000|
| 2| Emily| null|
| 4| David| null|
+---+-------+------+

Ascending with nulls first:
+---+-------+------+
| ID| Name|Salary|
+---+-------+------+
| 2| Emily| null|
| 4| David| null|
| 1| James| 45000|
| 3|Michael| 52000|
+---+-------+------+

Sorting with SQL Queries

You can also sort using SQL syntax after registering the DataFrame as a temporary view:

df.createOrReplaceTempView("employees")

spark.sql("""
SELECT * FROM employees
ORDER BY Company ASC, Salary DESC, Name ASC
""").show()

Output:

+---+-------+---------+------+
| ID| Name| Company|Salary|
+---+-------+---------+------+
| 4| Sarah|Company 1| 60000|
| 2| Emily|Company 1| 52000|
| 5| David|Company 1| 48000|
| 1| James|Company 1| 45000|
| 6|Michael|Company 1| 39000|
| 3|Michael|Company 2| 45000|
+---+-------+---------+------+

Common Mistake: Mismatched ascending List Length

When passing a list to the ascending parameter, it must have the same number of elements as the column list. A mismatch can produce unexpected results or errors.

Wrong:3 columns but only 2 ascending values

# This may raise an error or produce unexpected behavior
df.orderBy(["Name", "Company", "Salary"], ascending=[True, False]).show()

Correct:match the number of columns

df.orderBy(
["Name", "Company", "Salary"],
ascending=[True, False, True]
).show()
caution

Always ensure the ascending list has the same number of elements as the columns list. For better safety and readability, prefer the col().asc() / col().desc() syntax, which makes the direction explicit for each column.

Comparison of Approaches

ApproachSyntaxMixed DirectionsNull ControlReadability
orderBy() with ascending=True/FalseorderBy(["A", "B"], ascending=True)✅ Via list❌ NoGood
orderBy() with ascending listorderBy(["A", "B"], ascending=[True, False])✅ Yes❌ NoModerate
col().asc() / col().desc()orderBy(col("A").asc(), col("B").desc())✅ Yes✅ YesBest
SQL ORDER BYORDER BY A ASC, B DESC✅ Yes✅ Yes (NULLS FIRST/LAST)Best for SQL users

Summary

Sorting PySpark DataFrames by multiple columns is straightforward with the right approach:

  • orderBy() and sort() are identical - use whichever you prefer.
  • Pass a list of column names with ascending=True or ascending=False for uniform sort direction.
  • Pass a list of booleans to ascending for mixed sort directions (e.g., ascending=[True, False]).
  • Use col().asc() and col().desc() for the most explicit and readable sorting, with support for null ordering via asc_nulls_first(), asc_nulls_last(), desc_nulls_first(), and desc_nulls_last().
  • Always ensure the ascending list length matches the number of columns to avoid unexpected behavior.