Python Pandas: How to Calculate Cumulative Percentage in Pandas
Cumulative percentages show the running total of a value expressed as a proportion of the overall total. They are essential for Pareto analysis (the 80/20 rule), understanding distribution patterns, identifying top contributors, and tracking progressive totals across sorted data. In fields ranging from sales analytics to inventory management, cumulative percentages help answer questions like "which products account for 80% of our revenue?"
In this guide, you will learn how to calculate cumulative percentages in Pandas, perform Pareto analysis to identify top contributors, and compute grouped cumulative percentages within categories.
The Standard Pattern
The formula for cumulative percentage is straightforward: (Cumulative Sum / Total Sum) x 100. Pandas makes this easy with the .cumsum() and .sum() methods:
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'B', 'C', 'D', 'E'],
'Sales': [500, 300, 100, 50, 50]
})
df['Cum_Sum'] = df['Sales'].cumsum()
df['Cum_Percent'] = (df['Sales'].cumsum() / df['Sales'].sum()) * 100
print(df)
Output:
Product Sales Cum_Sum Cum_Percent
0 A 500 500 50.0
1 B 300 800 80.0
2 C 100 900 90.0
3 D 50 950 95.0
4 E 50 1000 100.0
Each row shows what proportion of the total has been accumulated up to and including that row. By the last row, the cumulative percentage always reaches 100%.
Pareto Analysis (80/20 Rule)
Pareto analysis identifies the small number of items that contribute the most to a total. The key requirement is to sort the data in descending order before calculating cumulative percentages. Without sorting, the cumulative values will not reveal which items are the top contributors.
import pandas as pd
df = pd.DataFrame({
'Product': ['C', 'A', 'E', 'B', 'D'],
'Sales': [100, 500, 50, 300, 50]
})
# Step 1: Sort by sales descending
df = df.sort_values('Sales', ascending=False).reset_index(drop=True)
# Step 2: Calculate cumulative percentage
df['Cum_Percent'] = (df['Sales'].cumsum() / df['Sales'].sum()) * 100
print(df)
Output:
Product Sales Cum_Percent
0 A 500 50.0
1 B 300 80.0
2 C 100 90.0
3 E 50 95.0
4 D 50 100.0
Products A and B together account for 80% of total sales, illustrating the classic 80/20 pattern where a small subset of items drives the majority of results.
Forgetting to sort before calculating cumulative percentages is a common mistake. Without descending sort, the cumulative values will not correctly identify which items are the biggest contributors, making Pareto analysis meaningless.
Finding the 80% Threshold Programmatically
Rather than reading the table manually, you can filter the DataFrame to extract exactly which items fall within the 80% threshold:
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'B', 'C', 'D', 'E'],
'Sales': [500, 300, 100, 50, 50]
})
df = df.sort_values('Sales', ascending=False).reset_index(drop=True)
df['Cum_Percent'] = (df['Sales'].cumsum() / df['Sales'].sum()) * 100
# Products contributing to 80% of sales
top_products = df[df['Cum_Percent'] <= 80]
print(f"Top products (80% of sales): {top_products['Product'].tolist()}")
print(f"Number of products: {len(top_products)} out of {len(df)}")
Output:
Top products (80% of sales): ['A', 'B']
Number of products: 2 out of 5
Building a Complete Pareto Table
A full Pareto analysis table includes both individual percentages and cumulative percentages, giving a complete picture of each item's contribution:
import pandas as pd
df = pd.DataFrame({
'Category': ['Electronics', 'Clothing', 'Food', 'Books', 'Other'],
'Revenue': [50000, 30000, 15000, 3000, 2000]
})
# Sort and calculate all metrics
df = df.sort_values('Revenue', ascending=False).reset_index(drop=True)
total = df['Revenue'].sum()
df['Percent'] = (df['Revenue'] / total) * 100
df['Cum_Revenue'] = df['Revenue'].cumsum()
df['Cum_Percent'] = (df['Cum_Revenue'] / total) * 100
print(df.round(1))
Output:
Category Revenue Percent Cum_Revenue Cum_Percent
0 Electronics 50000 50.0 50000 50.0
1 Clothing 30000 30.0 80000 80.0
2 Food 15000 15.0 95000 95.0
3 Books 3000 3.0 98000 98.0
4 Other 2000 2.0 100000 100.0
This format makes it easy to see that Electronics alone accounts for 50% of revenue, and Electronics plus Clothing together reach the 80% mark.
Grouped Cumulative Percentage
When your data spans multiple categories or regions, you often need cumulative percentages calculated within each group rather than across the entire dataset. Use groupby() combined with a cumulative sum:
import pandas as pd
df = pd.DataFrame({
'Region': ['East', 'East', 'East', 'West', 'West'],
'Product': ['A', 'B', 'C', 'A', 'B'],
'Sales': [100, 200, 50, 150, 100]
})
# Sort within each group by sales descending
df = df.sort_values(['Region', 'Sales'], ascending=[True, False])
# Calculate cumulative percentage within each region
df['Cum_Pct'] = df.groupby('Region')['Sales'].apply(
lambda x: (x.cumsum() / x.sum()) * 100
).values
print(df.round(1))
Output:
Region Product Sales Cum_Pct
1 East B 200 57.1
0 East A 100 85.7
2 East C 50 100.0
3 West A 150 60.0
4 West B 100 100.0
Each region's cumulative percentages independently reach 100%, giving you a Pareto breakdown within each group.
Common Mistake: Calculating Without Sorting
To see why sorting matters, compare the results with and without descending sort:
import pandas as pd
df = pd.DataFrame({
'Product': ['C', 'A', 'E', 'B', 'D'],
'Sales': [100, 500, 50, 300, 50]
})
# Without sorting (misleading)
df['Cum_Pct_Unsorted'] = (df['Sales'].cumsum() / df['Sales'].sum()) * 100
print("Without sorting:")
print(df[['Product', 'Sales', 'Cum_Pct_Unsorted']])
Output:
Without sorting:
Product Sales Cum_Pct_Unsorted
0 C 100 10.0
1 A 500 60.0
2 E 50 65.0
3 B 300 95.0
4 D 50 100.0
The unsorted version makes it appear that product C is a top contributor simply because it happens to be first in the data. After sorting, the true picture emerges with product A clearly leading.
Always sort your data in descending order by the value column before computing cumulative percentages for Pareto analysis. The cumulative percentage is only meaningful for identifying top contributors when the largest values come first.
Quick Reference
| Step | Method | Purpose |
|---|---|---|
| Sort | .sort_values(ascending=False) | Required for meaningful Pareto analysis |
| Running total | .cumsum() | Accumulate values progressively |
| Total | .sum() | Denominator for percentage |
| Normalize | / total * 100 | Convert to percentage |
| Group | .groupby().apply() | Cumulative percentage within categories |
Calculate cumulative percentages with (df['col'].cumsum() / df['col'].sum()) * 100.
- For Pareto analysis, always sort in descending order first so the largest contributors appear at the top.
- Use
groupby().apply()when you need independent cumulative percentages within separate categories or regions.