Skip to main content

Python Pandas: How to Select a Subset of a DataFrame in Pandas

Selecting subsets of data is one of the most fundamental operations in data analysis. Whether you need to extract specific columns for a report, filter rows based on conditions, or isolate a precise block of data, Pandas provides powerful and flexible tools to accomplish this.

In this guide, you'll learn how to select subsets by columns, rows, or both, using methods like bracket notation, loc[], iloc[], and conditional filtering.

Setting Up the Sample Data

All examples use a sample dataset. You can follow along with any CSV file or create one:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age     Team    Salary  Experience
0 Alice 28 Lakers 8500000 4
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18
3 Diana 31 Heat 9500000 7
4 Eve 26 Celtics 7000000 3

Selecting Columns

Selecting a Single Column

Use bracket notation with the column name to extract a single column as a Series:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

ages = df['Age']
print(ages)
print(f"\nType: {type(ages)}")

Output:

0    28
1 35
2 42
3 31
4 26
Name: Age, dtype: int64

Type: <class 'pandas.core.series.Series'>

Selecting Multiple Columns

Pass a list of column names inside brackets to select multiple columns. The result is a DataFrame:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

subset = df[['Name', 'Age', 'Salary']]
print(subset)
print(f"\nType: {type(subset)}")

Output:

      Name  Age    Salary
0 Alice 28 8500000
1 Bob 35 12000000
2 Charlie 42 15000000
3 Diana 31 9500000
4 Eve 26 7000000

Type: <class 'pandas.core.frame.DataFrame'>
Common Mistake: Single vs Double Brackets

Using single brackets with a column name returns a Series, while double brackets return a DataFrame (even for one column):

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# Series (1D)
series = df['Name']
print(type(series)) # <class 'pandas.core.series.Series'>

# DataFrame (2D)
dataframe = df[['Name']]
print(type(dataframe)) # <class 'pandas.core.frame.DataFrame'>

This distinction matters when chaining operations that expect a DataFrame input.

Selecting Rows by Condition

Single Condition

Use boolean indexing to filter rows where a condition is true:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# Select players older than 30
result = df[df['Age'] > 30]
print(result)

Output:

      Name  Age     Team    Salary  Experience
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18
3 Diana 31 Heat 9500000 7

Multiple Conditions

Combine conditions using & (AND) and | (OR). Each condition must be wrapped in parentheses:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# Players older than 30 AND salary above 10 million
result = df[(df['Age'] > 30) & (df['Salary'] > 10000000)]
print(result)

Output:

      Name  Age     Team    Salary  Experience
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# Players on the Lakers OR younger than 28
result = df[(df['Team'] == 'Lakers') | (df['Age'] < 28)]
print(result)

Output:

      Name  Age     Team    Salary  Experience
0 Alice 28 Lakers 8500000 4
2 Charlie 42 Lakers 15000000 18
4 Eve 26 Celtics 7000000 3

Using isin() for Multiple Values

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# Select players on specific teams
teams = ['Lakers', 'Heat']
result = df[df['Team'].isin(teams)]
print(result)

Output:

      Name  Age    Team    Salary  Experience
0 Alice 28 Lakers 8500000 4
2 Charlie 42 Lakers 15000000 18
3 Diana 31 Heat 9500000 7

Selecting Rows and Columns Together

Using loc[] (Label-Based Selection)

loc[] selects data by labels - column names and row conditions or index labels:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# Select Name column for players older than 30
result = df.loc[df['Age'] > 30, 'Name']
print(result)

Output:

1        Bob
2 Charlie
3 Diana
Name: Name, dtype: object
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)


# Select Name and Salary for Lakers players
result = df.loc[df['Team'] == 'Lakers', ['Name', 'Salary']]
print(result)

Output:

      Name    Salary
0 Alice 8500000
2 Charlie 15000000

Using iloc[] (Position-Based Selection)

iloc[] selects data by integer positions - row and column indices:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# Select first 3 rows and first 2 columns
result = df.iloc[:3, :2]
print(result)

Output:

      Name  Age
0 Alice 28
1 Bob 35
2 Charlie 42
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)


# Select specific rows and columns by position
result = df.iloc[[0, 2, 4], [0, 3]]
print(result)

Output:

      Name    Salary
0 Alice 8500000
2 Charlie 15000000
4 Eve 7000000
loc[] vs iloc[]
Featureloc[]iloc[]
Selection byLabels (names)Integer positions
Row selectionCondition or index labelRow index number
Column selectionColumn nameColumn index number
End of sliceInclusiveExclusive
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# loc: end is INCLUSIVE
print(df.loc[0:2, 'Name':'Team']) # Rows 0,1,2 (Columns Name, Age, Team)

# iloc: end is EXCLUSIVE
print(df.iloc[0:2, 0:3]) # Rows 0,1 (Columns 0, 1, 2)

Output:

      Name  Age     Team
0 Alice 28 Lakers
1 Bob 35 Celtics
2 Charlie 42 Lakers
Name Age Team
0 Alice 28 Lakers
1 Bob 35 Celtics

Quick Subset Methods

head() - First N Rows

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# First 3 rows (default is 5)
print(df.head(3))

Output:

      Name  Age     Team    Salary  Experience
0 Alice 28 Lakers 8500000 4
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18

tail() - Last N Rows

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# Last 2 rows
print(df.tail(2))

Output:

    Name  Age     Team   Salary  Experience
3 Diana 31 Heat 9500000 7
4 Eve 26 Celtics 7000000 3

sample() - Random Rows

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# 3 random rows (set random_state for reproducibility)
print(df.sample(3, random_state=42))

Output:

      Name  Age     Team    Salary  Experience
1 Bob 35 Celtics 12000000 10
4 Eve 26 Celtics 7000000 3
2 Charlie 42 Lakers 15000000 18

nlargest() and nsmallest() - Top/Bottom by Value

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# Top 3 highest salaries
print(df.nlargest(3, 'Salary'))

Output:

      Name  Age     Team    Salary  Experience
2 Charlie 42 Lakers 15000000 18
1 Bob 35 Celtics 12000000 10
3 Diana 31 Heat 9500000 7
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

# 2 youngest players
print(df.nsmallest(2, 'Age'))

Output:

    Name  Age     Team   Salary  Experience
4 Eve 26 Celtics 7000000 3
0 Alice 28 Lakers 8500000 4

Using query() for Readable Filtering

The query() method lets you write filter conditions as strings, which can be more readable for complex selections:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)


# Equivalent to: df[(df['Age'] > 30) & (df['Team'] == 'Lakers')]
result = df.query('Age > 30 and Team == "Lakers"')
print(result)

Output:

      Name  Age    Team    Salary  Experience
2 Charlie 42 Lakers 15000000 18

You can reference variables with @:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [28, 35, 42, 31, 26],
'Team': ['Lakers', 'Celtics', 'Lakers', 'Heat', 'Celtics'],
'Salary': [8500000, 12000000, 15000000, 9500000, 7000000],
'Experience': [4, 10, 18, 7, 3]
}

df = pd.DataFrame(data)

min_salary = 9000000
result = df.query('Salary >= @min_salary')
print(result)

Output:

      Name  Age     Team    Salary  Experience
1 Bob 35 Celtics 12000000 10
2 Charlie 42 Lakers 15000000 18
3 Diana 31 Heat 9500000 7

Summary

TaskMethodExample
Select one columndf['col']df['Age']
Select multiple columnsdf[['col1', 'col2']]df[['Name', 'Age']]
Filter rows by conditiondf[condition]df[df['Age'] > 30]
Multiple conditionsdf[(cond1) & (cond2)]df[(df['Age'] > 30) & (df['Salary'] > 1e7)]
Rows + columns by labeldf.loc[condition, cols]df.loc[df['Age'] > 30, ['Name', 'Salary']]
Rows + columns by positiondf.iloc[rows, cols]df.iloc[:3, :2]
Match from a listdf[df['col'].isin(list)]df[df['Team'].isin(['Lakers', 'Heat'])]
String-based querydf.query('expression')df.query('Age > 30')
First N rowsdf.head(n)df.head(10)
Last N rowsdf.tail(n)df.tail(5)
Random rowsdf.sample(n)df.sample(3)
Top N by valuedf.nlargest(n, 'col')df.nlargest(3, 'Salary')

Pandas offers multiple ways to select subsets of a DataFrame.

  • Use bracket notation for quick column or row filtering.
  • Use loc[] for label-based selection combining rows and columns
  • Use iloc[] for integer-position-based selection
  • Use query() for readable string-based filtering.

Choose the method that makes your code clearest for the task at hand.