Python Pandas: How to Calculate Absolute Frequency and Relative Frequency in Python Pandas
Understanding value distributions is the first step in data exploration. How many users are "Active"? What percentage of transactions failed? Pandas provides elegant solutions for both absolute counts and relative percentages.
Core Concepts
- Absolute Frequency: Raw counts (e.g., "50 users")
- Relative Frequency: Proportions or percentages (e.g., "25% of users")
Pandas handles both with the versatile .value_counts() method.
Absolute Frequency
Count distinct values by calling .value_counts() on a Series:
import pandas as pd
data = ['Pass', 'Pass', 'Fail', 'Pass', 'Fail', 'Pass']
df = pd.DataFrame(data, columns=['Status'])
counts = df['Status'].value_counts()
print(counts)
Output:
Pass 4
Fail 2
Name: Status, dtype: int64
Sorting Options
import pandas as pd
df = pd.DataFrame({'Grade': ['B', 'A', 'C', 'A', 'B', 'A', 'B', 'C']})
# Default: sorted by count (descending)
print(df['Grade'].value_counts())
# A 3
# B 3
# C 2
# Sort alphabetically by index
print(df['Grade'].value_counts().sort_index())
# A 3
# B 3
# C 2
# Ascending count order
print(df['Grade'].value_counts(ascending=True))
# C 2
# A 3
# B 3
Output:
Grade
B 3
A 3
C 2
Name: count, dtype: int64
Grade
A 3
B 3
C 2
Name: count, dtype: int64
Grade
C 2
B 3
A 3
Name: count, dtype: int64
Relative Frequency (Percentages)
Use normalize=True to get proportions instead of raw counts:
import pandas as pd
data = ['Pass', 'Pass', 'Fail', 'Pass', 'Fail', 'Pass']
df = pd.DataFrame(data, columns=['Status'])
percentages = df['Status'].value_counts(normalize=True)
print(percentages)
Output:
Status
Pass 0.666667
Fail 0.333333
Name: proportion, dtype: float64
Formatting as Readable Percentages
import pandas as pd
df = pd.DataFrame({'Status': ['Pass', 'Pass', 'Fail', 'Pass', 'Fail', 'Pass']})
# Multiply by 100
pct = df['Status'].value_counts(normalize=True) * 100
print(pct)
# Format with % symbol
formatted = df['Status'].value_counts(normalize=True).apply(lambda x: f"{x:.1%}")
print(formatted)
Output:
Status
Pass 66.666667
Fail 33.333333
Name: proportion, dtype: float64
Status
Pass 66.7%
Fail 33.3%
Name: proportion, dtype: object
Use normalize=True to quickly identify dominant categories. A value appearing in 95% of rows behaves differently than one appearing in 5%.
Handling Missing Values
By default, value_counts() ignores NaN values. Include them with dropna=False:
import pandas as pd
data = ['A', 'B', None, 'A', None, 'B', 'A']
s = pd.Series(data)
# Default: excludes NaN
print(s.value_counts())
print()
# Include NaN counts
print(s.value_counts(dropna=False))
Output:
A 3
B 2
Name: count, dtype: int64
A 3
B 2
None 2
Name: count, dtype: int64
Percentage Including Missing
import pandas as pd
import numpy as np
df = pd.DataFrame({'Category': ['X', 'Y', np.nan, 'X', np.nan]})
# What percentage of data is missing?
pct_with_na = df['Category'].value_counts(normalize=True, dropna=False)
print(pct_with_na)
Output:
Category
X 0.4
NaN 0.4
Y 0.2
Name: proportion, dtype: float64
Binning Numeric Data
For continuous data, use bins to create frequency distributions:
import pandas as pd
ages = pd.Series([22, 35, 45, 23, 38, 52, 28, 41, 55, 33])
# Create age groups
age_counts = ages.value_counts(bins=3, sort=False)
print(age_counts)
Output:
(21.965999999999998, 33.0] 4
(33.0, 44.0] 3
(44.0, 55.0] 3
Name: count, dtype: int64
Custom Bins
import pandas as pd
scores = pd.Series([45, 67, 82, 91, 73, 55, 88, 76, 62, 95])
# Define grade boundaries
bins = [0, 60, 70, 80, 90, 100]
labels = ['F', 'D', 'C', 'B', 'A']
grades = pd.cut(scores, bins=bins, labels=labels)
print(grades.value_counts().sort_index())
Output:
F 2
D 2
C 2
B 2
A 2
Name: count, dtype: int64
Both Counts and Percentages Together
Create a summary DataFrame with both metrics:
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'C', 'B', 'A']
})
counts = df['Product'].value_counts()
percentages = df['Product'].value_counts(normalize=True) * 100
summary = pd.DataFrame({
'Count': counts,
'Percentage': percentages.round(1)
})
print(summary)
Output:
Count Percentage
Product
A 5 50.0
B 3 30.0
C 2 20.0
Cross-Tabulation for Multiple Columns
Compare frequencies across two categorical variables:
import pandas as pd
df = pd.DataFrame({
'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'M'],
'Status': ['Pass', 'Pass', 'Fail', 'Pass', 'Pass', 'Fail', 'Pass', 'Fail']
})
# Count combinations
print(pd.crosstab(df['Gender'], df['Status']))
print()
# With percentages (normalize by row)
print(pd.crosstab(df['Gender'], df['Status'], normalize='index').round(2))
Output:
Status Fail Pass
Gender
F 1 2
M 2 3
Status Fail Pass
Gender
F 0.33 0.67
M 0.40 0.60
Practical Example: Survey Analysis
import pandas as pd
# Survey responses
survey = pd.DataFrame({
'Satisfaction': ['Very Happy', 'Happy', 'Neutral', 'Happy',
'Unhappy', 'Very Happy', 'Happy', 'Neutral',
'Happy', 'Very Happy']
})
# Order matters for satisfaction levels
order = ['Very Happy', 'Happy', 'Neutral', 'Unhappy', 'Very Unhappy']
counts = survey['Satisfaction'].value_counts()
# Reindex to enforce order
counts = counts.reindex(order).fillna(0).astype(int)
print("Survey Results:")
print(counts)
# Cumulative percentage
cumulative = (counts.cumsum() / counts.sum() * 100).round(1)
print(f"\n{cumulative['Happy']}% are Happy or better")
Output:
Survey Results:
Satisfaction
Very Happy 3
Happy 4
Neutral 2
Unhappy 1
Very Unhappy 0
Name: count, dtype: int64
70.0% are Happy or better
Quick Reference
| Goal | Code |
|---|---|
| Count values | df['col'].value_counts() |
| Get percentages | df['col'].value_counts(normalize=True) |
| Include missing | df['col'].value_counts(dropna=False) |
| Sort by index | df['col'].value_counts().sort_index() |
| Ascending order | df['col'].value_counts(ascending=True) |
| Bin numeric data | df['col'].value_counts(bins=5) |
| Cross-tabulation | pd.crosstab(df['col1'], df['col2']) |
Summary
- Use
value_counts()for absolute frequencies and addnormalize=Truefor relative frequencies. - Remember to set
dropna=Falsewhen missing values are analytically important. - For numeric data, the
binsparameter creates histogram-style frequency distributions without needing to pre-categorize your data.