Python Pandas: How to Sort a Pandas DataFrame by Multiple Columns
Sorting a DataFrame by multiple columns is a fundamental operation in data analysis. It allows you to organize data hierarchically - for example, sorting employees first by department and then by salary within each department. Pandas offers several methods to accomplish this, from the versatile sort_values() to specialized functions like nlargest() and nsmallest().
This guide covers each approach with practical examples, outputs, and guidance on when to use which method.
Sample DataFrame
All examples use the following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Ryan', 'Aiden', 'Noah', 'Caleb', 'Emma', 'Chloe'],
'Age': [20, 22, 21, 19, 17, 23],
'Rank': [1, np.nan, 8, 9, 4, np.nan]
})
print(df)
Output:
Name Age Rank
0 Ryan 20 1.0
1 Aiden 22 NaN
2 Noah 21 8.0
3 Caleb 19 9.0
4 Emma 17 4.0
5 Chloe 23 NaN
Using sort_values() - The Most Flexible Method
The sort_values() method is the standard way to sort a DataFrame by one or more columns. It supports independent ascending/descending order per column and configurable NaN placement.
Sorting by Multiple Columns
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Ryan', 'Aiden', 'Noah', 'Caleb', 'Emma', 'Chloe'],
'Age': [20, 22, 21, 19, 17, 23],
'Rank': [1, np.nan, 8, 9, 4, np.nan]
})
# Sort by 'Rank' ascending, then by 'Age' descending for ties
result = df.sort_values(
by=['Rank', 'Age'],
ascending=[True, False],
na_position='last'
)
print(result)
Output:
Name Age Rank
0 Ryan 20 1.0
4 Emma 17 4.0
2 Noah 21 8.0
3 Caleb 19 9.0
5 Chloe 23 NaN
1 Aiden 22 NaN
Key parameters:
| Parameter | Description |
|---|---|
by | List of column names to sort by, in priority order |
ascending | List of booleans - True for ascending, False for descending - one per column |
na_position | 'last' (default) or 'first' - where to place NaN values |
ignore_index | If True, resets the index to 0, 1, 2, ... after sorting |
Sorting with Reset Index
When you want a clean index after sorting:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Ryan', 'Aiden', 'Noah', 'Caleb', 'Emma', 'Chloe'],
'Age': [20, 22, 21, 19, 17, 23],
'Rank': [1, np.nan, 8, 9, 4, np.nan]
})
result = df.sort_values(
by=['Rank', 'Age'],
ascending=[True, False],
na_position='last',
ignore_index=True
)
print(result)
Output:
Name Age Rank
0 Ryan 20 1.0
1 Emma 17 4.0
2 Noah 21 8.0
3 Caleb 19 9.0
4 Chloe 23 NaN
5 Aiden 22 NaN
sort_values() is the go-to method for most sorting tasks. It handles multiple columns, mixed sort directions, and NaN values cleanly in a single call.
Using nlargest() - Top N Rows by Value
The nlargest() method is optimized to retrieve the top N rows with the highest values in specified columns. It is faster than sorting the entire DataFrame when you only need a few top results.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Ryan', 'Aiden', 'Noah', 'Caleb', 'Emma', 'Chloe'],
'Age': [20, 22, 21, 19, 17, 23],
'Rank': [1, np.nan, 8, 9, 4, np.nan]
})
# Get the 3 rows with the highest 'Rank' values
top_3 = df.nlargest(3, 'Rank')
print(top_3)
Output:
Name Age Rank
3 Caleb 19 9.0
2 Noah 21 8.0
4 Emma 17 4.0
NaN values are automatically ignored. You can also sort by multiple columns - nlargest uses the second column as a tiebreaker:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Ryan', 'Aiden', 'Noah', 'Caleb', 'Emma', 'Chloe'],
'Age': [20, 22, 21, 19, 17, 23],
'Rank': [1, np.nan, 8, 9, 4, np.nan]
})
# Top 3 by Rank, then by Age for ties
top_3 = df.nlargest(3, ['Rank', 'Age'])
print(top_3)
Output:
Name Age Rank
3 Caleb 19 9.0
2 Noah 21 8.0
4 Emma 17 4.0
Using nsmallest() - Bottom N Rows by Value
The counterpart to nlargest(), this method retrieves the lowest N values:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Ryan', 'Aiden', 'Noah', 'Caleb', 'Emma', 'Chloe'],
'Age': [20, 22, 21, 19, 17, 23],
'Rank': [1, np.nan, 8, 9, 4, np.nan]
})
# Get the 3 rows with the lowest 'Rank' values
bottom_3 = df.nsmallest(3, 'Rank')
print(bottom_3)
Output:
Name Age Rank
0 Ryan 20 1.0
4 Emma 17 4.0
2 Noah 21 8.0
nlargest() and nsmallest() are particularly efficient for large DataFrames when you only need a small subset of top/bottom rows. They use a partial sort algorithm that avoids sorting the entire dataset.
Using sort_index()
The sort_index() method sorts by the DataFrame's index rather than column values. This is useful after operations that disorder the index, such as filtering, grouping, or setting a custom index.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Ryan', 'Aiden', 'Noah', 'Caleb', 'Emma', 'Chloe'],
'Age': [20, 22, 21, 19, 17, 23],
'Rank': [1, np.nan, 8, 9, 4, np.nan]
})
# Sort by index in descending order
result = df.sort_index(ascending=False)
print(result)
Output:
Name Age Rank
5 Chloe 23 NaN
4 Emma 17 4.0
3 Caleb 19 9.0
2 Noah 21 8.0
1 Aiden 22 NaN
0 Ryan 20 1.0
Sorting by a MultiIndex
sort_index() becomes powerful with a MultiIndex:
import pandas as pd
df = pd.DataFrame({
'Score': [88, 92, 75, 81, 95, 70]
}, index=pd.MultiIndex.from_tuples([
('Math', 'Alice'), ('Math', 'Bob'),
('Science', 'Alice'), ('Science', 'Bob'),
('English', 'Alice'), ('English', 'Bob')
], names=['Subject', 'Student']))
result = df.sort_index(level=['Subject', 'Student'], ascending=[True, False])
print(result)
Output:
Score
Subject Student
English Bob 70
Alice 95
Math Bob 92
Alice 88
Science Bob 81
Alice 75
Using argsort() with NumPy
For performance-critical scenarios, np.argsort() computes sorted indices directly on the underlying NumPy array, which can be faster than Pandas sorting for very large datasets:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Ryan', 'Aiden', 'Noah', 'Caleb', 'Emma', 'Chloe'],
'Age': [20, 22, 21, 19, 17, 23],
'Rank': [1, np.nan, 8, 9, 4, np.nan]
})
# Get sorted indices for 'Rank' column
sorted_idx = np.argsort(df['Rank'].values)
result = df.iloc[sorted_idx]
print(result)
Output:
Name Age Rank
0 Ryan 20 1.0
4 Emma 17 4.0
2 Noah 21 8.0
3 Caleb 19 9.0
1 Aiden 22 NaN
5 Chloe 23 NaN
np.argsort() sorts by a single column only. For multi-column sorting, sort_values() is far more practical. Also, np.argsort() places NaN values at the end by default, but this behavior is not configurable - use sort_values(na_position=...) if you need control over NaN placement.
Common Mistake: Mismatched ascending List Length
When sorting by multiple columns, the ascending parameter must have the same length as the by parameter. Passing a single boolean applies it to all columns, which may not be intended:
import pandas as pd
df = pd.DataFrame({
'Department': ['Sales', 'Sales', 'Engineering', 'Engineering'],
'Salary': [50000, 60000, 70000, 55000]
})
# WRONG: single boolean applies to ALL columns
# This sorts both Department and Salary in descending order
result = df.sort_values(by=['Department', 'Salary'], ascending=False)
print(result)
Output:
Department Salary
1 Sales 60000
0 Sales 50000
2 Engineering 70000
3 Engineering 55000
If the intent was to sort by Department ascending and Salary descending, the result above is incorrect.
The correct approach:
import pandas as pd
df = pd.DataFrame({
'Department': ['Sales', 'Sales', 'Engineering', 'Engineering'],
'Salary': [50000, 60000, 70000, 55000]
})
# CORRECT: specify ascending/descending for each column individually
result = df.sort_values(by=['Department', 'Salary'], ascending=[True, False])
print(result)
Output:
Department Salary
2 Engineering 70000
3 Engineering 55000
1 Sales 60000
0 Sales 50000
Method Comparison
| Method | Sorts By | Multi-Column | NaN Handling | Best For |
|---|---|---|---|---|
sort_values() | Column values | Yes | Configurable (na_position) | General-purpose sorting - recommended |
nlargest() | Column values (descending) | Yes (tiebreaker) | Ignores NaN | Getting top N rows efficiently |
nsmallest() | Column values (ascending) | Yes (tiebreaker) | Ignores NaN | Getting bottom N rows efficiently |
sort_index() | Index values | Yes (MultiIndex) | N/A | Reordering by row index |
np.argsort() | Single column (NumPy) | No | Pushes to end | Performance-critical, single-column sorting |
For most multi-column sorting tasks, sort_values() is the recommended method. It offers the most flexibility with independent sort directions, configurable NaN placement, and clean syntax for any number of columns.