How to Compare Word and Character Frequencies in Python
Analyzing text data often requires determining how frequently specific characters or words appear. Whether you are building a search engine, analyzing sentiment, or simply cleaning data, comparing string frequencies is a fundamental task.
This guide explains how to count occurrences using basic methods, optimize the process with the collections module, and statistically compare the frequency distributions of two different text sources.
Understanding String Frequency Analysis
At its core, frequency analysis involves mapping unique items (characters or words) to their count. While you can use standard loops and dictionaries, Python provides built-in methods that are significantly faster and more readable.
Basic Counting with count()
For simple, single-target queries, the string method .count() is sufficient.
text = "hello world hello python"
# ✅ Simple check for a specific substring
count_hello = text.count("hello")
print(f"Count of 'hello': {count_hello}")
Output:
Count of 'hello': 2
However, if you need to count all words or compare distributions, .count() is inefficient because it requires iterating through the text for every unique word.
Method 1: Using collections.Counter (Recommended)
The collections module provides the Counter class, which is a specialized dictionary designed specifically for counting hashable objects. It is the industry standard for frequency analysis in Python.
Counting Characters vs. Words
from collections import Counter
text = "banana"
sentence = "python is fun and python is powerful"
# ✅ Counting Characters
char_freq = Counter(text)
print(f"Character Frequencies: {char_freq}")
# ✅ Counting Words (requires splitting)
word_freq = Counter(sentence.split())
print(f"Word Frequencies: {word_freq}")
# ✅ Getting the most common items
print(f"Top 2 words: {word_freq.most_common(2)}")
Output:
Character Frequencies: Counter({'a': 3, 'n': 2, 'b': 1})
Word Frequencies: Counter({'python': 2, 'is': 2, 'fun': 1, 'and': 1, 'powerful': 1})
Top 2 words: [('python', 2), ('is', 2)]
Method 2: Comparing Two Texts
A powerful feature of Counter objects is that they support mathematical operations. You can find the intersection (common words) or difference (unique words) between two texts simply by using operators like & and -.
Finding Similarities and Differences
from collections import Counter
text_a = "apple banana orange apple"
text_b = "banana grape apple banana"
# Create frequency maps
freq_a = Counter(text_a.split())
freq_b = Counter(text_b.split())
# ✅ Intersection: Minimum counts of common elements
common = freq_a & freq_b
print(f"Shared content: {common}")
# ✅ Difference: Elements in A that are not in B (or excess counts)
diff_a_b = freq_a - freq_b
print(f"Unique to A (or excess): {diff_a_b}")
Output:
Shared content: Counter({'apple': 1, 'banana': 1})
Unique to A (or excess): Counter({'apple': 1, 'orange': 1})
In the intersection example, even though "apple" appears twice in A and once in B, the shared count is 1 (the minimum).
Common Pitfall: Case Sensitivity and Punctuation
A frequent error in frequency comparison is failing to normalize the input. To a computer, "Python" and "python" are completely different strings, and "end." is different from "end".
Handling Noise in Text
from collections import Counter
import re
raw_text = "Python is great. python is easy!"
# ⛔️ Error: Case sensitivity causes duplication
# 'Python' and 'python' are counted as separate entries
bad_counter = Counter(raw_text.split())
print(f"Bad Count: {bad_counter}")
# ✅ Solution: Normalize case and remove punctuation
# 1. Convert to lowercase
# 2. Use regex to extract only words
clean_words = re.findall(r'\w+', raw_text.lower())
good_counter = Counter(clean_words)
print(f"Good Count: {good_counter}")
Output:
Bad Count: Counter({'is': 2, 'Python': 1, 'great.': 1, 'python': 1, 'easy!': 1})
Good Count: Counter({'python': 2, 'is': 2, 'great': 1, 'easy': 1})
Conclusion
Comparing string frequencies allows you to uncover patterns and similarities in text data.
- Use
collections.Counterfor virtually all frequency tasks; it is faster and cleaner than manual loops. - Use Math Operators (
&,-,+) on Counter objects to compare two different texts efficiently. - Normalize Data by lowercasing text and stripping punctuation to ensure accurate counts.