How to Count Word Frequency in Text in Python
Word frequency analysis is a foundational technique in text processing, natural language processing, and data analytics. From analyzing customer feedback to building search engines, understanding which words appear most often reveals valuable patterns in your data. However, accurate word counting requires more than simply splitting text by spaces. You must handle punctuation, case sensitivity, and special characters to get reliable results.
In this guide, you will learn how to count word frequencies using Python's built-in tools, clean messy real-world text, filter out common stop words, and build a reusable function that handles a variety of analysis scenarios.
Quick Solution with Counter and String Cleaning
For most text analysis tasks, combining Python's collections.Counter with punctuation removal provides an efficient and readable solution:
import string
from collections import Counter
text = "Mango, apple, Orange, apple! ORANGE."
# Remove punctuation and normalize to lowercase
translator = str.maketrans("", "", string.punctuation)
clean_text = text.translate(translator).lower()
# Split into words and count frequencies
word_counts = Counter(clean_text.split())
print(word_counts)
Output:
Counter({'apple': 2, 'orange': 2, 'mango': 1})
The str.maketrans() and .translate() combination strips all punctuation characters in a single pass, and .lower() ensures that "Orange" and "ORANGE" are counted as the same word.
Use the .most_common() method to retrieve the most frequent words in descending order:
top_three = word_counts.most_common(3)
print(top_three)
Output:
[('apple', 2), ('orange', 2), ('mango', 1)]
Why Simple .split() Is Not Enough
A common beginner approach is to split the text and count directly without any cleaning:
from collections import Counter
text = "Mango, apple, Orange, apple! ORANGE."
word_counts = Counter(text.split())
print(word_counts)
Output:
Counter({'Mango,': 1, 'apple,': 1, 'Orange,': 1, 'apple!': 1, 'ORANGE.': 1})
Every word is treated as unique because punctuation is still attached and casing differs. "apple,", "apple!", and "ORANGE." are all counted as separate entries. Always clean your text before counting.
Handling Complex Text with Regular Expressions
When processing messy real-world data containing numbers, timestamps, special symbols, or mixed formatting, regular expressions provide precise control over what qualifies as a "word":
import re
from collections import Counter
raw_data = "Error: Code 404... Error found at 10:00! [ERROR] logged."
# Extract only alphabetic words, normalized to lowercase
words = re.findall(r"\b[a-zA-Z]+\b", raw_data.lower())
frequency = Counter(words)
print(frequency)
Output:
Counter({'error': 3, 'code': 1, 'found': 1, 'at': 1, 'logged': 1})
The pattern r'\b[a-zA-Z]+\b' matches only sequences of letters bounded by word boundaries. This automatically excludes numbers like 404 and timestamps like 10:00 from your word count, giving you cleaner analytical results without writing additional filtering logic.
Processing Text Files
For analyzing entire files or documents, read the content and apply the same cleaning techniques:
import re
from collections import Counter
def analyze_file(filepath):
"""Count word frequencies in a text file."""
with open(filepath, "r", encoding="utf-8") as file:
content = file.read().lower()
words = re.findall(r"\b[a-zA-Z]+\b", content)
return Counter(words)
# Usage
word_freq = analyze_file("article.txt")
print(word_freq.most_common(10))
For very large files that might not fit in memory, process the file line by line instead:
import re
from collections import Counter
def analyze_large_file(filepath):
"""Count word frequencies in a large text file line by line."""
word_counts = Counter()
with open(filepath, "r", encoding="utf-8") as file:
for line in file:
words = re.findall(r"\b[a-zA-Z]+\b", line.lower())
word_counts.update(words)
return word_counts
The .update() method on a Counter object adds counts from the new words to the existing totals, so you get the same result as processing the entire file at once but with constant memory usage.
Always specify encoding="utf-8" when opening text files. On Windows, the default encoding may differ from UTF-8, which can cause inconsistent results or UnicodeDecodeError exceptions.
Excluding Common Stop Words
For meaningful analysis, you often want to filter out common words like "the", "and", or "is" that carry little analytical value:
import re
from collections import Counter
text = "The quick brown fox jumps over the lazy dog and the cat"
# Common English stop words
stop_words = {
"the", "a", "an", "and", "or", "but", "is", "are", "was", "were",
"in", "on", "at", "to", "for", "of", "with", "over"
}
# Extract and filter words
words = re.findall(r"\b[a-zA-Z]+\b", text.lower())
filtered_words = [word for word in words if word not in stop_words]
frequency = Counter(filtered_words)
print(frequency)
Output:
Counter({'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1, 'cat': 1})
With stop words removed, the remaining words reveal the actual content of the text rather than being dominated by common grammatical words.
For production-level NLP work, consider using a comprehensive stop word list from a library like NLTK (nltk.corpus.stopwords) or spaCy instead of maintaining your own set. These libraries provide curated stop word lists for multiple languages.
Complete Reusable Function
Here is a comprehensive function that combines all the techniques covered above into a flexible, reusable tool:
import re
from collections import Counter
def count_words(text, min_length=1, exclude_words=None, top_n=None):
"""
Count word frequencies with customizable filtering.
Args:
text: Input string to analyze.
min_length: Minimum word length to include.
exclude_words: Set of words to exclude (stop words).
top_n: Return only the top N most common words.
Returns:
Counter object, or list of tuples if top_n is specified.
"""
exclude_words = exclude_words or set()
# Extract alphabetic words
words = re.findall(r"\b[a-zA-Z]+\b", text.lower())
# Apply filters
filtered = [
w for w in words
if len(w) >= min_length and w not in exclude_words
]
counts = Counter(filtered)
return counts.most_common(top_n) if top_n else counts
# Usage examples
sample = "Python is amazing. Python is powerful. Python is everywhere!"
print(count_words(sample))
print(count_words(sample, exclude_words={"is"}, top_n=3))
print(count_words(sample, min_length=5))
Output:
Counter({'python': 3, 'is': 3, 'amazing': 1, 'powerful': 1, 'everywhere': 1})
[('python', 3), ('amazing', 1), ('powerful', 1)]
Counter({'python': 3, 'amazing': 1, 'powerful': 1, 'everywhere': 1})
The min_length parameter filters out short words, exclude_words removes stop words, and top_n limits the results to the most frequent entries.
Method Comparison
| Approach | Best For | Handles Punctuation | Performance |
|---|---|---|---|
split() only | Quick debugging, already clean text | No | Fastest |
translate() + Counter | Clean prose, articles, simple text | Yes | Fast |
re.findall() + Counter | Messy data, logs, web scraping | Yes | Moderate |
Conclusion
For most word frequency tasks, the combination of re.findall() for word extraction and collections.Counter for counting provides the best balance of accuracy and flexibility.
- Use the simpler
str.translate()+split()approach when your text is clean prose without numbers or special formatting. - Always normalize case with
.lower()or.casefold()before counting, and remove or filter stop words when you need meaningful analytical results rather than raw frequency data. - By separating text cleaning from frequency counting, you create flexible and maintainable code that produces accurate word statistics regardless of the input source.