Python Pandas: How to Classify and Grade Data with Pandas
Assigning categories to rows based on numerical ranges or complex conditions is a common data transformation. Pandas and NumPy provide efficient, vectorized methods that outperform row-by-row iteration.
Using pd.cut(): Numeric Binning
Best for converting continuous numbers into discrete categories:
import pandas as pd
df = pd.DataFrame({'score': [55, 92, 78, 40, 85, 100]})
# Define bin edges and labels
bins = [0, 60, 80, 100]
labels = ['Fail', 'Pass', 'Excellent']
df['grade'] = pd.cut(df['score'], bins=bins, labels=labels)
print(df)
Output:
score grade
0 55 Fail
1 92 Excellent
2 78 Pass
3 40 Fail
4 85 Excellent
5 100 Excellent
Including Edge Values
import pandas as pd
df = pd.DataFrame({'score': [0, 60, 80, 100]})
# Default: (lower, upper], excludes left, includes right
# Use include_lowest=True to include the leftmost edge
df['grade'] = pd.cut(
df['score'],
bins=[0, 60, 80, 100],
labels=['Fail', 'Pass', 'Excellent'],
include_lowest=True
)
Output:
score grade
0 0 Fail
1 60 Fail
2 80 Pass
3 100 Excellent
Using np.select(): Multiple Conditions
Ideal for complex logic involving multiple columns:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'age': [15, 22, 45, 70, 19],
'is_student': [True, True, False, False, False]
})
# Define conditions (evaluated in order)
conditions = [
(df['age'] < 18),
(df['is_student'] == True),
(df['age'] >= 65)
]
# Corresponding labels
choices = ['Minor', 'Student', 'Senior']
# Assign with default for unmatched rows
df['category'] = np.select(conditions, choices, default='Adult')
print(df)
Output:
age is_student category
0 15 True Minor
1 22 True Student
2 45 False Adult
3 70 False Senior
4 19 False Adult
tip
Conditions in np.select() are evaluated in order. The first matching condition wins, so place more specific conditions before general ones.
Using np.where(): Binary Classification
For simple either/or conditions:
import pandas as pd
import numpy as np
df = pd.DataFrame({'score': [55, 92, 78, 40, 85]})
df['passed'] = np.where(df['score'] >= 60, 'Yes', 'No')
print(df)
Output:
score passed
0 55 No
1 92 Yes
2 78 Yes
3 40 No
4 85 Yes
Nested np.where() for Multiple Categories
import pandas as pd
import numpy as np
df = pd.DataFrame({'score': [55, 92, 78, 40]})
df['grade'] = np.where(
df['score'] >= 80, 'Excellent',
np.where(df['score'] >= 60, 'Pass', 'Fail')
)
Output:
score grade
0 55 Fail
1 92 Excellent
2 78 Pass
3 40 Fail
Using map(): Dictionary Lookup
For direct value mapping:
import pandas as pd
df = pd.DataFrame({'status_code': [1, 2, 3, 1, 2]})
status_map = {
1: 'Pending',
2: 'Approved',
3: 'Rejected'
}
df['status'] = df['status_code'].map(status_map)
print(df)
Output:
status_code status
0 1 Pending
1 2 Approved
2 3 Rejected
3 1 Pending
4 2 Approved
Using pd.qcut(): Quantile-Based Binning
Create bins with equal number of records:
import pandas as pd
df = pd.DataFrame({'income': [20000, 35000, 50000, 75000, 150000]})
# Split into 3 equal-sized groups
df['income_tier'] = pd.qcut(df['income'], q=3, labels=['Low', 'Medium', 'High'])
print(df)
Output:
income income_tier
0 20000 Low
1 35000 Low
2 50000 Medium
3 75000 High
4 150000 High
Performance Comparison
import pandas as pd
import numpy as np
df = pd.DataFrame({'score': range(100000)})
# ❌ Slow: apply with lambda
df['grade'] = df['score'].apply(
lambda x: 'Excellent' if x >= 80 else ('Pass' if x >= 60 else 'Fail')
)
# ✅ Fast: pd.cut (vectorized)
df['grade'] = pd.cut(df['score'], bins=[0, 60, 80, 100], labels=['Fail', 'Pass', 'Excellent'])
# ✅ Fast: np.select (vectorized)
conditions = [df['score'] >= 80, df['score'] >= 60]
df['grade'] = np.select(conditions, ['Excellent', 'Pass'], default='Fail')
Quick Reference
| Method | Best For | Performance |
|---|---|---|
pd.cut() | Numeric ranges/bins | ⚡ Fast |
pd.qcut() | Equal-frequency bins | ⚡ Fast |
np.select() | Multiple complex conditions | ⚡ Fast |
np.where() | Binary conditions | ⚡ Fast |
.map() | Direct value lookup | ⚡ Fast |
.apply() | Complex custom logic | 🐢 Slow |
Summary
- Use
pd.cut()for straightforward numeric binning like grades or age groups. - Use
np.select()for multi-condition classification involving multiple columns. - Use
np.where()for simple binary categories.
Reserve .apply() only for logic too complex to vectorize-it's typically 10-100x slower than vectorized alternatives.