How to Extract Percentages from a String in Python
Extracting percentage values from text is a common task in data mining, report parsing, web scraping, and natural language processing. Whether you're processing financial reports, survey results, or log files, you often need to pull out values like "85%", "99.9%", or "100%" from unstructured text.
In this guide, you will learn multiple methods to extract percentages from strings in Python using regular expressions, with patterns that handle various formats including integers, decimals, and spaces before the % sign.
Understanding the Problem
Given a string containing text mixed with percentage values:
text = "The success rate is 85% with 99.9% accuracy and 100% coverage"
Extract all percentages:
["85%", "99.9%", "100%"]
Method 1: Using re.findall() with Regex (Recommended)
The most reliable approach uses re.findall() with a regex pattern that matches numbers followed by the % symbol:
import re
text = "tutorialreference is 100% way to get 200% success"
percentages = re.findall(r"\d+%", text)
print(percentages)
Output:
['100%', '200%']
How it works:
\d+matches one or more digits.%matches the literal percent sign.findall()returns all non-overlapping matches as a list.
Handling Decimal Percentages
The basic \d+% pattern misses decimal percentages like 99.9%. Use an extended pattern:
import re
text = "Accuracy is 99.9% and precision is 87.5%, with 100% recall"
percentages = re.findall(r"\d+\.?\d*%", text)
print(percentages)
Output:
['99.9%', '87.5%', '100%']
Pattern breakdown:
\d+- one or more digits (integer part)\.?- optional decimal point\d*- zero or more digits (decimal part)%- literal percent sign
Handling Spaces Before %
Some text formats include a space before the percent sign (e.g., "100 %"):
import re
text = "Success rate is 85 % with 99.9% accuracy"
# Match numbers with optional space before %
percentages = re.findall(r"\d+\.?\d*\s*%", text)
print(percentages)
Output:
['85 %', '99.9%']
Use \s* (zero or more whitespace characters) before % to handle both "85%" and "85 %" formats. If you want to normalize the output by removing spaces:
# Normalize: remove spaces before %
cleaned = [p.replace(" ", "") for p in percentages]
print(cleaned) # ['85%', '99.9%']
Method 2: Using split() and Filtering
A non-regex approach splits the string into words and filters for those containing %:
text = "tutorialreference is 100% way to get 200% success"
words = text.split()
percentages = [word for word in words if "%" in word]
print(percentages)
Output:
['100%', '200%']
This works for simple cases but fails when the % is separated from the number by a space or when words contain % mixed with other characters.
Method 3: Extracting Numeric Values Without the % Sign
Sometimes you need the numeric values as actual numbers rather than strings with %:
import re
text = "The rates are 15.5%, 22%, and 8.75% respectively"
# Extract just the numbers (without %)
values = [float(x) for x in re.findall(r"(\d+\.?\d*)\s*%", text)]
print(values)
Output:
[15.5, 22.0, 8.75]
How it works:
- The parentheses
(\d+\.?\d*)create a capturing group that captures only the numeric part. findall()returns only the captured group content (without%).float()converts each match to a number.
Comprehensive Regex Pattern
Here's a robust pattern that handles all common percentage formats:
import re
def extract_percentages(text):
"""Extract all percentages from text, handling various formats."""
# Matches: 100%, 99.9%, 0.5%, .5%, 85 %, negative like -3.5%
pattern = r"-?\d*\.?\d+\s*%"
matches = re.findall(pattern, text)
return [m.replace(" ", "") for m in matches] # Normalize spaces
# Test with various formats
tests = [
"Growth is 15.5% and decline is -3.2%",
"Rates: 100%, 0.5%, .75%",
"Score is 85 % out of 100 %",
"No percentages here",
]
for text in tests:
print(f"Input: {text}")
print(f"Output: {extract_percentages(text)}\n")
Output:
Input: Growth is 15.5% and decline is -3.2%
Output: ['15.5%', '-3.2%']
Input: Rates: 100%, 0.5%, .75%
Output: ['100%', '0.5%', '.75%']
Input: Score is 85 % out of 100 %
Output: ['85%', '100%']
Input: No percentages here
Output: []
Common Mistake: Greedy Matching with Adjacent Text
A poorly constructed regex can match unintended text:
Problem: pattern too broad
import re
text = "ID12345% discount"
# This matches "12345%" which isn't a meaningful percentage
result = re.findall(r"\d+%", text)
print(result) # ['12345%']
Fix: add word boundaries to match standalone numbers
import re
text = "ID12345% discount is 25% off"
# \b ensures the number is a standalone word
result = re.findall(r"\b\d+\.?\d*\s*%", text)
print(result)
Output:
['25%']
Use word boundaries (\b) when you need to match only standalone percentage values and avoid capturing numbers embedded in identifiers or codes.
Extracting Percentages from Multi-Line Text
For processing documents, reports, or log files:
import re
report = """
Q1 Revenue: increased by 12.5%
Q2 Revenue: decreased by 3.2%
Q3 Revenue: stable at 0%
Q4 Revenue: increased by 8.75%
Overall growth: 18.05%
"""
percentages = re.findall(r"-?\d+\.?\d*%", report)
print("All percentages found:", percentages)
print(f"Count: {len(percentages)}")
# Convert to numbers for analysis
values = [float(p.rstrip("%")) for p in percentages]
print(f"Average: {sum(values) / len(values):.2f}%")
print(f"Max: {max(values)}%")
print(f"Min: {min(values)}%")
Output:
All percentages found: ['12.5%', '3.2%', '0%', '8.75%', '18.05%']
Count: 5
Average: 8.50%
Max: 18.05%
Min: 0.0%
Comparison of Methods
| Method | Handles Decimals | Handles Spaces | Handles Negatives | Robustness |
|---|---|---|---|---|
re.findall(r"\d+%") | ❌ No | ❌ No | ❌ No | Basic |
re.findall(r"\d+\.?\d*%") | ✅ Yes | ❌ No | ❌ No | Good |
re.findall(r"-?\d*\.?\d+\s*%") | ✅ Yes | ✅ Yes | ✅ Yes | Robust |
split() + filtering | ❌ No | ❌ No | ❌ No | Fragile |
Summary
Extracting percentages from strings in Python is best accomplished with regular expressions. Key takeaways:
- Use
re.findall(r"\d+\.?\d*%", text)for the most common use case - matching integer and decimal percentages. - Add
\s*before%to handle formats with spaces between the number and percent sign. - Add
-?at the start to capture negative percentages. - Use capturing groups
(\d+\.?\d*)when you need only the numeric values without the%sign. - Use word boundaries (
\b) to avoid matching numbers embedded in identifiers. - Convert results to
floatwithfloat(p.rstrip("%"))for numerical analysis.