Python Pandas: How to Select a Subset of a DataFrame in Pandas
Selecting subsets of data is one of the most fundamental operations in data analysis. Whether you need to extract specific columns for a report, filter rows based on conditions, or isolate a precise block of data, Pandas provides powerful and flexible tools to accomplish this.
In this guide, you'll learn how to select subsets by columns, rows, or both, using methods like bracket notation, loc[], iloc[], and conditional filtering.
Setting Up the Sample Data
All examples use a sample dataset. You can follow along with any CSV file or create one:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Team Salary Experience
0 Alice 28 Lakers 8500000 4
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18
3 Diana 31 Heat 9500000 7
4 Eve 26 Celtics 7000000 3
Selecting Columns
Selecting a Single Column
Use bracket notation with the column name to extract a single column as a Series:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
ages = df['Age']
print(ages)
print(f"\nType: {type(ages)}")
Output:
0 28
1 35
2 42
3 31
4 26
Name: Age, dtype: int64
Type: <class 'pandas.core.series.Series'>
Selecting Multiple Columns
Pass a list of column names inside brackets to select multiple columns. The result is a DataFrame:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
subset = df[['Name', 'Age', 'Salary']]
print(subset)
print(f"\nType: {type(subset)}")
Output:
Name Age Salary
0 Alice 28 8500000
1 Bob 35 12000000
2 Charlie 42 15000000
3 Diana 31 9500000
4 Eve 26 7000000
Type: <class 'pandas.core.frame.DataFrame'>
Using single brackets with a column name returns a Series, while double brackets return a DataFrame (even for one column):
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Series (1D)
series = df['Name']
print(type(series)) # <class 'pandas.core.series.Series'>
# DataFrame (2D)
dataframe = df[['Name']]
print(type(dataframe)) # <class 'pandas.core.frame.DataFrame'>
This distinction matters when chaining operations that expect a DataFrame input.
Selecting Rows by Condition
Single Condition
Use boolean indexing to filter rows where a condition is true:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Select players older than 30
result = df[df['Age'] > 30]
print(result)
Output:
Name Age Team Salary Experience
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18
3 Diana 31 Heat 9500000 7
Multiple Conditions
Combine conditions using & (AND) and | (OR). Each condition must be wrapped in parentheses:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Players older than 30 AND salary above 10 million
result = df[(df['Age'] > 30) & (df['Salary'] > 10000000)]
print(result)
Output:
Name Age Team Salary Experience
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Players on the Lakers OR younger than 28
result = df[(df['Team'] == 'Lakers') | (df['Age'] < 28)]
print(result)
Output:
Name Age Team Salary Experience
0 Alice 28 Lakers 8500000 4
2 Charlie 42 Lakers 15000000 18
4 Eve 26 Celtics 7000000 3
Using isin() for Multiple Values
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Select players on specific teams
teams = ['Lakers', 'Heat']
result = df[df['Team'].isin(teams)]
print(result)
Output:
Name Age Team Salary Experience
0 Alice 28 Lakers 8500000 4
2 Charlie 42 Lakers 15000000 18
3 Diana 31 Heat 9500000 7
Selecting Rows and Columns Together
Using loc[] (Label-Based Selection)
loc[] selects data by labels - column names and row conditions or index labels:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Select Name column for players older than 30
result = df.loc[df['Age'] > 30, 'Name']
print(result)
Output:
1 Bob
2 Charlie
3 Diana
Name: Name, dtype: object
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Select Name and Salary for Lakers players
result = df.loc[df['Team'] == 'Lakers', ['Name', 'Salary']]
print(result)
Output:
Name Salary
0 Alice 8500000
2 Charlie 15000000
Using iloc[] (Position-Based Selection)
iloc[] selects data by integer positions - row and column indices:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Select first 3 rows and first 2 columns
result = df.iloc[:3, :2]
print(result)
Output:
Name Age
0 Alice 28
1 Bob 35
2 Charlie 42
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Select specific rows and columns by position
result = df.iloc[[0, 2, 4], [0, 3]]
print(result)
Output:
Name Salary
0 Alice 8500000
2 Charlie 15000000
4 Eve 7000000
loc[] vs iloc[]| Feature | loc[] | iloc[] |
|---|---|---|
| Selection by | Labels (names) | Integer positions |
| Row selection | Condition or index label | Row index number |
| Column selection | Column name | Column index number |
| End of slice | Inclusive | Exclusive |
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# loc: end is INCLUSIVE
print(df.loc[0:2, 'Name':'Team']) # Rows 0,1,2 (Columns Name, Age, Team)
# iloc: end is EXCLUSIVE
print(df.iloc[0:2, 0:3]) # Rows 0,1 (Columns 0, 1, 2)
Output:
Name Age Team
0 Alice 28 Lakers
1 Bob 35 Celtics
2 Charlie 42 Lakers
Name Age Team
0 Alice 28 Lakers
1 Bob 35 Celtics
Quick Subset Methods
head() - First N Rows
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# First 3 rows (default is 5)
print(df.head(3))
Output:
Name Age Team Salary Experience
0 Alice 28 Lakers 8500000 4
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18
tail() - Last N Rows
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Last 2 rows
print(df.tail(2))
Output:
Name Age Team Salary Experience
3 Diana 31 Heat 9500000 7
4 Eve 26 Celtics 7000000 3
sample() - Random Rows
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# 3 random rows (set random_state for reproducibility)
print(df.sample(3, random_state=42))
Output:
Name Age Team Salary Experience
1 Bob 35 Celtics 12000000 10
4 Eve 26 Celtics 7000000 3
2 Charlie 42 Lakers 15000000 18
nlargest() and nsmallest() - Top/Bottom by Value
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Top 3 highest salaries
print(df.nlargest(3, 'Salary'))
Output:
Name Age Team Salary Experience
2 Charlie 42 Lakers 15000000 18
1 Bob 35 Celtics 12000000 10
3 Diana 31 Heat 9500000 7
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# 2 youngest players
print(df.nsmallest(2, 'Age'))
Output:
Name Age Team Salary Experience
4 Eve 26 Celtics 7000000 3
0 Alice 28 Lakers 8500000 4
Using query() for Readable Filtering
The query() method lets you write filter conditions as strings, which can be more readable for complex selections:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
# Equivalent to: df[(df['Age'] > 30) & (df['Team'] == 'Lakers')]
result = df.query('Age > 30 and Team == "Lakers"')
print(result)
Output:
Name Age Team Salary Experience
2 Charlie 42 Lakers 15000000 18
You can reference variables with @:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}
df = pd.DataFrame(data)
min_salary = 9000000
result = df.query('Salary >= @min_salary')
print(result)
Output:
Name Age Team Salary Experience
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18
3 Diana 31 Heat 9500000 7
Summary
| Task | Method | Example |
|---|---|---|
| Select one column | df['col'] | df['Age'] |
| Select multiple columns | df[['col1', 'col2']] | df[['Name', 'Age']] |
| Filter rows by condition | df[condition] | df[df['Age'] > 30] |
| Multiple conditions | df[(cond1) & (cond2)] | df[(df['Age'] > 30) & (df['Salary'] > 1e7)] |
| Rows + columns by label | df.loc[condition, cols] | df.loc[df['Age'] > 30, ['Name', 'Salary']] |
| Rows + columns by position | df.iloc[rows, cols] | df.iloc[:3, :2] |
| Match from a list | df[df['col'].isin(list)] | df[df['Team'].isin(['Lakers', 'Heat'])] |
| String-based query | df.query('expression') | df.query('Age > 30') |
| First N rows | df.head(n) | df.head(10) |
| Last N rows | df.tail(n) | df.tail(5) |
| Random rows | df.sample(n) | df.sample(3) |
| Top N by value | df.nlargest(n, 'col') | df.nlargest(3, 'Salary') |
Pandas offers multiple ways to select subsets of a DataFrame.
- Use bracket notation for quick column or row filtering.
- Use
loc[]for label-based selection combining rows and columns - Use
iloc[]for integer-position-based selection - Use
query()for readable string-based filtering.
Choose the method that makes your code clearest for the task at hand.