Python Pandas: How to Use Hierarchical Indexes with Pandas
Hierarchical indexing (also called multi-indexing) allows you to set multiple columns as the index of a DataFrame, creating a tree-like structure of labels. This is powerful for organizing complex datasets, such as data grouped by region and state, or by year and quarter, enabling efficient slicing, aggregation, and restructuring.
This guide explains how to create, navigate, and manipulate hierarchical indexes in Pandas.
What Is a Hierarchical Index?
A standard DataFrame has a single-level index. A hierarchical index has two or more levels, where each level represents a different category:
Standard Index: Hierarchical Index:
Level 0 Level 1
Index Value Region State Value
0 100 East New York 100
1 200 East New Jersey 200
2 300 West California 300
3 400 West Oregon 400
The outer level (Region) groups data, and the inner level (State) provides specifics within each group.
Creating a Hierarchical Index
Using set_index() with Multiple Columns
Pass a list of column names to set_index() to create a multi-level index:
import pandas as pd
df = pd.DataFrame({
'Region': ['East', 'East', 'West', 'West', 'South', 'South'],
'State': ['New York', 'New Jersey', 'California', 'Oregon', 'Texas', 'Florida'],
'Population': [19500000, 9300000, 39500000, 4200000, 29100000, 21500000],
'GDP': [1900, 650, 3600, 270, 1900, 1100]
})
print("Original DataFrame:")
print(df)
# Set Region and State as hierarchical index
df_multi = df.set_index(['Region', 'State'])
print("\nWith Hierarchical Index:")
print(df_multi)
Output:
Original DataFrame:
Region State Population GDP
0 East New York 19500000 1900
1 East New Jersey 9300000 650
2 West California 39500000 3600
3 West Oregon 4200000 270
4 South Texas 29100000 1900
5 South Florida 21500000 1100
With Hierarchical Index:
Population GDP
Region State
East New York 19500000 1900
New Jersey 9300000 650
West California 39500000 3600
Oregon 4200000 270
South Texas 29100000 1900
Florida 21500000 1100
The Region column becomes level 0 and State becomes level 1 of the index.
Sorting the Hierarchical Index
For efficient slicing, always sort a multi-index after creating it:
df_multi = df_multi.sort_index()
print(df_multi)
Output:
Population GDP
Region State
East New Jersey 9300000 650
New York 19500000 1900
South Florida 21500000 1100
Texas 29100000 1900
West California 39500000 3600
Oregon 4200000 270
Always call sort_index() after setting a hierarchical index. Many loc[] operations require a sorted index to work correctly and perform efficiently. An unsorted multi-index may raise a KeyError or produce unexpected results.
Selecting Data with a Hierarchical Index
Selecting by the Outer Level (Level 0)
Use loc[] with the outer level value to get all rows in that group:
# Select all states in the 'East' region
east = df_multi.loc['East']
print(east)
Output:
Population GDP
State
New Jersey 9300000 650
New York 19500000 1900
The result drops the outer index level and uses the inner level (State) as the index.
Selecting Multiple Outer Level Values
Pass a list to select multiple groups:
# Select East and West regions
east_west = df_multi.loc[['East', 'West']]
print(east_west)
Output:
Population GDP
Region State
East New Jersey 9300000 650
New York 19500000 1900
West California 39500000 3600
Oregon 4200000 270
Selecting by Both Levels Using Tuples
To select specific combinations of outer and inner index values, use tuples:
# Select specific Region-State combinations
selected = df_multi.loc[('East', 'New York')]
print(selected)
Output:
Population 19500000
GDP 1900
Name: (East, New York), dtype: int64
For multiple specific combinations, pass a list of tuples:
selected = df_multi.loc[[('East', 'New York'), ('West', 'California')]]
print(selected)
Output:
Region State
East New York 19500000 1900
West California 39500000 3600
Common Mistake: Selecting by Inner Level Only
You cannot directly use loc[] with only the inner level value:
# WRONG: selecting by inner level directly raises KeyError
try:
result = df_multi.loc['New York']
except KeyError as e:
print(f"KeyError: {e}")
Output:
KeyError: 'New York'
The correct approaches:
# Option 1: Use xs() to select by an inner level
result = df_multi.xs('New York', level='State')
print("Using xs():")
print(result)
# Option 2: Use a tuple with the outer level
result = df_multi.loc[('East', 'New York')]
print("\nUsing tuple:")
print(result)
Output:
Using xs():
Population GDP
Region
East 19500000 1900
Using tuple:
Population 19500000
GDP 1900
Name: (East, New York), dtype: int64
You cannot use loc[] with only an inner-level value. Use xs(value, level='level_name') to select data from any specific level, or provide values for all outer levels using tuples.
Using xs() for Cross-Section Selection
The xs() method is designed for selecting data at a specific level of a multi-index:
import pandas as pd
df = pd.DataFrame({
'Year': [2023, 2023, 2024, 2024],
'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
'Revenue': [100, 150, 120, 180],
'Profit': [20, 35, 25, 45]
})
df_multi = df.set_index(['Year', 'Quarter']).sort_index()
print(df_multi)
# Select all Q1 data across years
q1_data = df_multi.xs('Q1', level='Quarter')
print("\nAll Q1 data:")
print(q1_data)
Output:
Revenue Profit
Year Quarter
2023 Q1 100 20
Q2 150 35
2024 Q1 120 25
Q2 180 45
All Q1 data:
Revenue Profit
Year
2023 100 20
2024 120 25
Resetting the Index
To convert the hierarchical index back to regular columns, use reset_index():
# Reset all index levels to columns
df_flat = df_multi.reset_index()
print(df_flat)
Output:
Year Quarter Revenue Profit
0 2023 Q1 100 20
1 2023 Q2 150 35
2 2024 Q1 120 25
3 2024 Q2 180 45
To reset only a specific level:
# Reset only the 'Quarter' level
df_partial = df_multi.reset_index(level='Quarter')
print(df_partial)
Output:
Quarter Revenue Profit
Year
2023 Q1 100 20
2023 Q2 150 35
2024 Q1 120 25
2024 Q2 180 45
Aggregating with Hierarchical Indexes
Multi-indexes work naturally with groupby() for hierarchical aggregation:
import pandas as pd
df = pd.DataFrame({
'Region': ['East', 'East', 'West', 'West', 'East', 'West'],
'State': ['NY', 'NJ', 'CA', 'OR', 'NY', 'CA'],
'Sales': [100, 80, 200, 50, 120, 180]
})
# Group by multiple levels and aggregate
summary = df.groupby(['Region', 'State'])['Sales'].agg(['sum', 'mean', 'count'])
print(summary)
Output:
sum mean count
Region State
East NJ 80 80.0 1
NY 220 110.0 2
West CA 380 190.0 2
OR 50 50.0 1
The result automatically has a hierarchical index.
Swapping Index Levels
Use swaplevel() to change the order of index levels:
import pandas as pd
df = pd.DataFrame({
'Region': ['East', 'East', 'West', 'West'],
'State': ['NY', 'NJ', 'CA', 'OR'],
'Population': [19500000, 9300000, 39500000, 4200000]
})
df_multi = df.set_index(['Region', 'State']).sort_index()
print("Original order:")
print(df_multi)
# Swap levels: State becomes level 0, Region becomes level 1
swapped = df_multi.swaplevel().sort_index()
print("\nSwapped levels:")
print(swapped)
Output:
Original order:
Population
Region State
East NJ 9300000
NY 19500000
West CA 39500000
OR 4200000
Swapped levels:
Population
State Region
CA West 39500000
NJ East 9300000
NY East 19500000
OR West 4200000
Quick Reference
| Operation | Code |
|---|---|
| Create hierarchical index | df.set_index(['col1', 'col2']) |
| Sort the multi-index | df.sort_index() |
| Select by outer level | df.loc['value'] |
| Select by both levels (tuple) | df.loc[('outer', 'inner')] |
| Select by inner level | df.xs('value', level='level_name') |
| Reset all levels to columns | df.reset_index() |
| Reset one level | df.reset_index(level='level_name') |
| Swap index levels | df.swaplevel() |
| Get index level names | df.index.names |
| Get index level values | df.index.get_level_values('level_name') |
Hierarchical indexing transforms flat DataFrames into structured, multi-dimensional data that mirrors real-world hierarchies. By mastering set_index(), loc[] with tuples, and xs(), you can efficiently organize, slice, and aggregate complex datasets in Pandas.