Skip to main content

Python Pandas: How to Classify and Grade Data with Pandas

Assigning categories to rows based on numerical ranges or complex conditions is a common data transformation. Pandas and NumPy provide efficient, vectorized methods that outperform row-by-row iteration.

Using pd.cut(): Numeric Binning

Best for converting continuous numbers into discrete categories:

import pandas as pd

df = pd.DataFrame({'score': [55, 92, 78, 40, 85, 100]})

# Define bin edges and labels
bins = [0, 60, 80, 100]
labels = ['Fail', 'Pass', 'Excellent']

df['grade'] = pd.cut(df['score'], bins=bins, labels=labels)
print(df)

Output:

   score      grade
0 55 Fail
1 92 Excellent
2 78 Pass
3 40 Fail
4 85 Excellent
5 100 Excellent

Including Edge Values

import pandas as pd

df = pd.DataFrame({'score': [0, 60, 80, 100]})

# Default: (lower, upper], excludes left, includes right
# Use include_lowest=True to include the leftmost edge
df['grade'] = pd.cut(
df['score'],
bins=[0, 60, 80, 100],
labels=['Fail', 'Pass', 'Excellent'],
include_lowest=True
)

Output:

   score      grade
0 0 Fail
1 60 Fail
2 80 Pass
3 100 Excellent

Using np.select(): Multiple Conditions

Ideal for complex logic involving multiple columns:

import pandas as pd
import numpy as np

df = pd.DataFrame({
'age': [15, 22, 45, 70, 19],
'is_student': [True, True, False, False, False]
})

# Define conditions (evaluated in order)
conditions = [
(df['age'] < 18),
(df['is_student'] == True),
(df['age'] >= 65)
]

# Corresponding labels
choices = ['Minor', 'Student', 'Senior']

# Assign with default for unmatched rows
df['category'] = np.select(conditions, choices, default='Adult')
print(df)

Output:

   age  is_student category
0 15 True Minor
1 22 True Student
2 45 False Adult
3 70 False Senior
4 19 False Adult
tip

Conditions in np.select() are evaluated in order. The first matching condition wins, so place more specific conditions before general ones.

Using np.where(): Binary Classification

For simple either/or conditions:

import pandas as pd
import numpy as np

df = pd.DataFrame({'score': [55, 92, 78, 40, 85]})

df['passed'] = np.where(df['score'] >= 60, 'Yes', 'No')
print(df)

Output:

   score passed
0 55 No
1 92 Yes
2 78 Yes
3 40 No
4 85 Yes

Nested np.where() for Multiple Categories

import pandas as pd
import numpy as np

df = pd.DataFrame({'score': [55, 92, 78, 40]})

df['grade'] = np.where(
df['score'] >= 80, 'Excellent',
np.where(df['score'] >= 60, 'Pass', 'Fail')
)

Output:

   score      grade
0 55 Fail
1 92 Excellent
2 78 Pass
3 40 Fail

Using map(): Dictionary Lookup

For direct value mapping:

import pandas as pd

df = pd.DataFrame({'status_code': [1, 2, 3, 1, 2]})

status_map = {
1: 'Pending',
2: 'Approved',
3: 'Rejected'
}

df['status'] = df['status_code'].map(status_map)
print(df)

Output:

   status_code    status
0 1 Pending
1 2 Approved
2 3 Rejected
3 1 Pending
4 2 Approved

Using pd.qcut(): Quantile-Based Binning

Create bins with equal number of records:

import pandas as pd

df = pd.DataFrame({'income': [20000, 35000, 50000, 75000, 150000]})

# Split into 3 equal-sized groups
df['income_tier'] = pd.qcut(df['income'], q=3, labels=['Low', 'Medium', 'High'])
print(df)

Output:

   income income_tier
0 20000 Low
1 35000 Low
2 50000 Medium
3 75000 High
4 150000 High

Performance Comparison

import pandas as pd
import numpy as np

df = pd.DataFrame({'score': range(100000)})

# ❌ Slow: apply with lambda
df['grade'] = df['score'].apply(
lambda x: 'Excellent' if x >= 80 else ('Pass' if x >= 60 else 'Fail')
)

# ✅ Fast: pd.cut (vectorized)
df['grade'] = pd.cut(df['score'], bins=[0, 60, 80, 100], labels=['Fail', 'Pass', 'Excellent'])

# ✅ Fast: np.select (vectorized)
conditions = [df['score'] >= 80, df['score'] >= 60]
df['grade'] = np.select(conditions, ['Excellent', 'Pass'], default='Fail')

Quick Reference

MethodBest ForPerformance
pd.cut()Numeric ranges/bins⚡ Fast
pd.qcut()Equal-frequency bins⚡ Fast
np.select()Multiple complex conditions⚡ Fast
np.where()Binary conditions⚡ Fast
.map()Direct value lookup⚡ Fast
.apply()Complex custom logic🐢 Slow

Summary

  • Use pd.cut() for straightforward numeric binning like grades or age groups.
  • Use np.select() for multi-condition classification involving multiple columns.
  • Use np.where() for simple binary categories.

Reserve .apply() only for logic too complex to vectorize-it's typically 10-100x slower than vectorized alternatives.