Skip to main content

How to Get Word Frequency as Percentage in Python

Calculating the percentage share of each word across a collection of text is a common task in natural language processing (NLP), text analytics, content analysis, and search engine optimization. Instead of just counting how many times a word appears, you express each word's frequency as a proportion of the total word count.

The formula is straightforward: (Occurrences of word / Total words) Ɨ 100.

In this guide, you'll learn efficient methods to compute word frequency percentages in Python, with clear examples and best practices for handling real-world text.

Understanding the Problem​

Given a list of strings, calculate each unique word's frequency as a fraction (or percentage) of the total number of words across all strings.

Example: For the text ["Python is great", "Python is fun"]:

WordCountTotal WordsFrequency
Python2633.33%
is2633.33%
great1616.67%
fun1616.67%

The collections.Counter class is purpose-built for counting occurrences. Combined with join() and split(), it provides the cleanest solution:

from collections import Counter

sentences = [
"Python is great for data science",
"Data science is fun",
"Python is the best for learning",
]

# Join all strings and split into individual words
all_words = " ".join(sentences).split()

# Count word frequencies
word_counts = Counter(all_words)
total_words = sum(word_counts.values())

# Calculate percentage for each word
frequency_pct = {word: (count / total_words) * 100 for word, count in word_counts.items()}

# Display results sorted by frequency (descending)
for word, pct in sorted(frequency_pct.items(), key=lambda x: x[1], reverse=True):
print(f" {word:<12} {pct:>6.2f}%")

Output:

  is            18.75%
Python 12.50%
for 12.50%
science 12.50%
great 6.25%
data 6.25%
Data 6.25%
fun 6.25%
the 6.25%
best 6.25%
learning 6.25%

How it works:

  1. " ".join(sentences) concatenates all strings into a single string separated by spaces.
  2. .split() breaks it into a list of individual words.
  3. Counter() counts how many times each word appears.
  4. Dividing each count by the total gives the frequency as a decimal; multiplying by 100 converts to a percentage.
Why Counter is the best choice

Counter is implemented in C under the hood, making it significantly faster than manually counting with loops. It also provides useful methods like .most_common(n) for retrieving the top N words.

Handling Case Sensitivity​

Notice in the output above that "data" and "Data" are treated as separate words. In most text analysis scenarios, you want case-insensitive counting:

from collections import Counter

sentences = [
"Python is great for data science",
"Data science is fun",
"Python is the best for learning",
]

# Normalize to lowercase before splitting
all_words = " ".join(sentences).lower().split()

word_counts = Counter(all_words)
total_words = sum(word_counts.values())

frequency_pct = {word: (count / total_words) * 100 for word, count in word_counts.items()}

for word, pct in sorted(frequency_pct.items(), key=lambda x: x[1], reverse=True):
print(f" {word:<12} {pct:>6.2f}%")

Output:

  is            18.75%
python 12.50%
for 12.50%
data 12.50%
science 12.50%
great 6.25%
fun 6.25%
the 6.25%
best 6.25%
learning 6.25%

Now "Data" and "data" are correctly merged into a single entry with a combined frequency of 12.50%.

Common Mistake: Using list.count() Inside a Loop​

A frequent performance pitfall is calling .count() inside a loop, which rescans the entire list for every unique word:

Wrong approach: O(n²) time complexity​

sentences = ["Python is great", "Python is fun"]
all_words = " ".join(sentences).split()

frequency = {}
for word in all_words:
if word not in frequency:
# .count() scans the ENTIRE list each time: O(n) per call
frequency[word] = all_words.count(word) / len(all_words)

print(frequency)

Output:

{'Python': 0.3333333333333333, 'is': 0.3333333333333333, 'great': 0.16666666666666666, 'fun': 0.16666666666666666}
note

This works but is quadratic in time complexity. For a list with 10,000 words, .count() is called for each unique word, and each call scans the entire list.

Correct approach: O(n) with Counter​

from collections import Counter

sentences = ["Python is great", "Python is fun"]
all_words = " ".join(sentences).split()

word_counts = Counter(all_words)
total = sum(word_counts.values())
frequency = {word: count / total for word, count in word_counts.items()}

print(frequency)

Output:

{'Python': 0.3333333333333333, 'is': 0.3333333333333333, 'great': 0.16666666666666666, 'fun': 0.16666666666666666}
Performance matters at scale

For small datasets, both approaches work fine. But with thousands of sentences or millions of words, the Counter approach is orders of magnitude faster. Always prefer Counter over manual .count() loops.

Creating a Reusable Function​

A production-ready function with options for case sensitivity, sorting, and percentage formatting:

from collections import Counter

def word_frequency_percentage(
texts: list[str],
case_sensitive: bool = False,
top_n: int | None = None,
as_percentage: bool = True,
) -> dict[str, float]:
"""Calculate the frequency percentage of each word across a list of strings.

Args:
texts: List of strings to analyze.
case_sensitive: Whether to treat 'Word' and 'word' as different (default: False).
top_n: Return only the top N most frequent words (default: all).
as_percentage: If True, values are percentages (0-100). If False, fractions (0-1).

Returns:
Dictionary mapping words to their frequency percentages, sorted descending.
"""
combined = " ".join(texts)
if not case_sensitive:
combined = combined.lower()

word_counts = Counter(combined.split())
total_words = sum(word_counts.values())

if total_words == 0:
return {}

multiplier = 100 if as_percentage else 1
items = word_counts.most_common(top_n)

return {word: (count / total_words) * multiplier for word, count in items}


# Usage
sentences = [
"Python is great for data science",
"Data science is fun",
"Python is the best for learning",
]

# Top 5 words as percentages
result = word_frequency_percentage(sentences, top_n=5)

print("Top 5 words by frequency:")
for word, pct in result.items():
print(f" {word:<12} {pct:.2f}%")

Output:

Top 5 words by frequency:
is 18.75%
python 12.50%
for 12.50%
data 12.50%
science 12.50%

Working with Files​

You can easily adapt this approach to analyze text files:

from collections import Counter

def analyze_file(filepath: str, top_n: int = 10) -> dict[str, float]:
"""Analyze word frequency percentages in a text file."""
with open(filepath, "r", encoding="utf-8") as f:
text = f.read().lower()

# Remove basic punctuation
for char in ".,!?;:\"'()-":
text = text.replace(char, "")

word_counts = Counter(text.split())
total = sum(word_counts.values())

return {
word: (count / total) * 100
for word, count in word_counts.most_common(top_n)
}


# Usage (assuming a file exists)
result = analyze_file("sample.txt", top_n=10)

Visualizing Word Frequencies​

For a quick visual representation, you can create a simple bar chart using the results:

from collections import Counter

sentences = [
"Python is great for data science",
"Data science is fun with Python",
"Python is the best language for data analysis",
]

all_words = " ".join(sentences).lower().split()
word_counts = Counter(all_words)
total = sum(word_counts.values())

print("Word Frequency Chart")
print("=" * 50)

for word, count in word_counts.most_common(8):
pct = (count / total) * 100
bar = "ā–ˆ" * int(pct * 2)
print(f" {word:<12} {bar} {pct:.1f}%")

Output:

Word Frequency Chart
==================================================
python ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 15.0%
is ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 15.0%
data ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 15.0%
for ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 10.0%
science ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 10.0%
great ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 5.0%
fun ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 5.0%
with ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 5.0%

Quick Comparison of Methods​

MethodTime ComplexityReadabilityBest For
Counter + dict comprehensionO(n)⭐⭐⭐ HighMost use cases (recommended)
Manual loop with CounterO(n)⭐⭐⭐ HighExtra processing during counting
list.count() in loopO(n²)⭐⭐ MediumAvoid: inefficient

Conclusion​

Calculating word frequency percentages in Python is simple and efficient with the right tools:

  • Counter from collections is the recommended approach: it's fast, clean, and provides useful methods like .most_common().
  • Normalize case with .lower() to avoid treating the same word as different entries.
  • Avoid list.count() inside loops: it creates O(n²) complexity that becomes a bottleneck with large texts.
  • Use the top_n pattern with .most_common(n) to focus on the most significant words.
  • For real-world text analysis, consider removing punctuation and stop words (common words like "the", "is", "and") before computing frequencies.