How to Build a Word Frequency Dictionary in Python
Word frequency dictionaries are foundational for text analysis, sentiment detection, keyword extraction, and search engines. Python's collections.Counter provides a high-performance, readable solution.
The Industry Standard: collections.Counterā
The Counter class is specifically designed for counting hashable objects:
from collections import Counter
text = "Python is great and Python is fast"
# Split text into words and count
words = text.split()
freq_map = Counter(words)
print(dict(freq_map))
Output:
{'Python': 2, 'is': 2, 'great': 1, 'and': 1, 'fast': 1}
Counter is implemented in C at the interpreter level, making it significantly faster than manual counting loops for large text volumes.
Real-World Cleaning: Punctuation and Caseā
In practice, "Python," and "python" should count as the same word:
import re
from collections import Counter
messy_text = "Data, data everywhere; DATA is key!"
# Lowercase and extract only alphanumeric words
clean_words = re.findall(r'\w+', messy_text.lower())
word_counts = Counter(clean_words)
print(dict(word_counts))
Output:
{'data': 3, 'everywhere': 1, 'is': 1, 'key': 1}
Getting Top N Wordsā
Counter provides built-in ranking:
from collections import Counter
text = "the quick brown fox jumps over the lazy dog the fox"
words = text.lower().split()
word_counts = Counter(words)
# Get 3 most common words
top_three = word_counts.most_common(3)
print(top_three)
Output:
[('the', 3), ('fox', 2), ('quick', 1)]
Use .most_common(n) to get the top N words without manual sorting logic.
Alternative: defaultdictā
For custom counting logic within loops:
from collections import defaultdict
text = "Python is great and Python is fast"
words = text.split()
freq = defaultdict(int)
for word in words:
freq[word] += 1
print(dict(freq))
Output:
{'Python': 2, 'is': 2, 'great': 1, 'and': 1, 'fast': 1}
Method Comparisonā
| Method | Speed | Complexity | Use Case |
|---|---|---|---|
Counter | š Fast | O(n) | Standard word counting |
defaultdict(int) | š Fast | O(n) | Custom counting logic |
.count() in loop | š¢ Slow | O(n²) | Avoid |
Never use text.count(word) inside a loop:
# ā Bad: O(n²) - scans entire text for each unique word
for word in set(words):
count = words.count(word) # Full scan each time!
# ā
Good: O(n) - single pass
freq = Counter(words)
Complete Exampleā
import re
from collections import Counter
def word_frequency(text, top_n=None):
"""Build word frequency dictionary from text."""
# Clean: lowercase, extract words only
words = re.findall(r'\b[a-z]+\b', text.lower())
counts = Counter(words)
if top_n:
return dict(counts.most_common(top_n))
return dict(counts)
# Usage
article = """
Python is amazing. Python handles data well.
Data science loves Python!
"""
print(word_frequency(article))
# {'python': 3, 'data': 2, 'is': 1, 'amazing': 1, ...}
print(word_frequency(article, top_n=3))
# {'python': 3, 'data': 2, 'is': 1}
Output:
{'python': 3, 'is': 1, 'amazing': 1, 'handles': 1, 'data': 2, 'well': 1, 'science': 1, 'loves': 1}
{'python': 3, 'data': 2, 'is': 1}
Quick Referenceā
| Goal | Code |
|---|---|
| Basic count | Counter(words) |
| Top N words | counter.most_common(n) |
| Clean text | re.findall(r'\w+', text.lower()) |
| To dictionary | dict(counter) |
Summaryā
Use collections.Counter for efficient single-pass word counting. Clean text with re.findall() to normalize case and remove punctuation. Use .most_common(n) to get top words without manual sorting. Avoid the O(n²) trap of calling .count() inside loops.