How to Count Word Frequency in Text in Python

Word frequency analysis is a foundational technique in text processing, natural language processing, and data analytics. From analyzing customer feedback to building search engines, understanding which words appear most often reveals valuable patterns in your data. However, accurate word counting requires more than simply splitting text by spaces. You must handle punctuation, case sensitivity, and special characters to get reliable results.

In this guide, you will learn how to count word frequencies using Python's built-in tools, clean messy real-world text, filter out common stop words, and build a reusable function that handles a variety of analysis scenarios.

Quick Solution with `Counter` and String Cleaning

For most text analysis tasks, combining Python's collections.Counter with punctuation removal provides an efficient and readable solution:

import string
from collections import Counter

text = "Mango, apple, Orange, apple! ORANGE."

# Remove punctuation and normalize to lowercase
translator = str.maketrans("", "", string.punctuation)
clean_text = text.translate(translator).lower()

# Split into words and count frequencies
word_counts = Counter(clean_text.split())

print(word_counts)

Output:

Counter({'apple': 2, 'orange': 2, 'mango': 1})

The str.maketrans() and .translate() combination strips all punctuation characters in a single pass, and .lower() ensures that "Orange" and "ORANGE" are counted as the same word.

Finding the Most Common Words

Use the .most_common() method to retrieve the most frequent words in descending order:

top_three = word_counts.most_common(3)
print(top_three)

Output:

[('apple', 2), ('orange', 2), ('mango', 1)]

Why Simple `.split()` Is Not Enough

A common beginner approach is to split the text and count directly without any cleaning:

from collections import Counter

text = "Mango, apple, Orange, apple! ORANGE."

word_counts = Counter(text.split())

print(word_counts)

Output:

Counter({'Mango,': 1, 'apple,': 1, 'Orange,': 1, 'apple!': 1, 'ORANGE.': 1})

Every word is treated as unique because punctuation is still attached and casing differs. "apple,", "apple!", and "ORANGE." are all counted as separate entries. Always clean your text before counting.

Handling Complex Text with Regular Expressions

When processing messy real-world data containing numbers, timestamps, special symbols, or mixed formatting, regular expressions provide precise control over what qualifies as a "word":

import re
from collections import Counter

raw_data = "Error: Code 404... Error found at 10:00! [ERROR] logged."

# Extract only alphabetic words, normalized to lowercase
words = re.findall(r"\b[a-zA-Z]+\b", raw_data.lower())

frequency = Counter(words)
print(frequency)

Output:

Counter({'error': 3, 'code': 1, 'found': 1, 'at': 1, 'logged': 1})

Why Use Regex for Word Extraction

The pattern r'\b[a-zA-Z]+\b' matches only sequences of letters bounded by word boundaries. This automatically excludes numbers like 404 and timestamps like 10:00 from your word count, giving you cleaner analytical results without writing additional filtering logic.

Processing Text Files

For analyzing entire files or documents, read the content and apply the same cleaning techniques:

import re
from collections import Counter

def analyze_file(filepath):
    """Count word frequencies in a text file."""
    with open(filepath, "r", encoding="utf-8") as file:
        content = file.read().lower()

    words = re.findall(r"\b[a-zA-Z]+\b", content)

    return Counter(words)

# Usage
word_freq = analyze_file("article.txt")
print(word_freq.most_common(10))

For very large files that might not fit in memory, process the file line by line instead:

import re
from collections import Counter

def analyze_large_file(filepath):
    """Count word frequencies in a large text file line by line."""
    word_counts = Counter()

    with open(filepath, "r", encoding="utf-8") as file:
        for line in file:
            words = re.findall(r"\b[a-zA-Z]+\b", line.lower())
            word_counts.update(words)

    return word_counts

The .update() method on a Counter object adds counts from the new words to the existing totals, so you get the same result as processing the entire file at once but with constant memory usage.

note

Always specify encoding="utf-8" when opening text files. On Windows, the default encoding may differ from UTF-8, which can cause inconsistent results or UnicodeDecodeError exceptions.

Excluding Common Stop Words

For meaningful analysis, you often want to filter out common words like "the", "and", or "is" that carry little analytical value:

import re
from collections import Counter

text = "The quick brown fox jumps over the lazy dog and the cat"

# Common English stop words
stop_words = {
    "the", "a", "an", "and", "or", "but", "is", "are", "was", "were",
    "in", "on", "at", "to", "for", "of", "with", "over"
}

# Extract and filter words
words = re.findall(r"\b[a-zA-Z]+\b", text.lower())
filtered_words = [word for word in words if word not in stop_words]

frequency = Counter(filtered_words)
print(frequency)

Output:

Counter({'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1, 'cat': 1})

With stop words removed, the remaining words reveal the actual content of the text rather than being dominated by common grammatical words.

tip

For production-level NLP work, consider using a comprehensive stop word list from a library like NLTK (nltk.corpus.stopwords) or spaCy instead of maintaining your own set. These libraries provide curated stop word lists for multiple languages.

Complete Reusable Function

Here is a comprehensive function that combines all the techniques covered above into a flexible, reusable tool:

import re
from collections import Counter

def count_words(text, min_length=1, exclude_words=None, top_n=None):
    """
    Count word frequencies with customizable filtering.

    Args:
        text: Input string to analyze.
        min_length: Minimum word length to include.
        exclude_words: Set of words to exclude (stop words).
        top_n: Return only the top N most common words.

    Returns:
        Counter object, or list of tuples if top_n is specified.
    """
    exclude_words = exclude_words or set()

    # Extract alphabetic words
    words = re.findall(r"\b[a-zA-Z]+\b", text.lower())

    # Apply filters
    filtered = [
        w for w in words
        if len(w) >= min_length and w not in exclude_words
    ]

    counts = Counter(filtered)

    return counts.most_common(top_n) if top_n else counts

# Usage examples
sample = "Python is amazing. Python is powerful. Python is everywhere!"

print(count_words(sample))
print(count_words(sample, exclude_words={"is"}, top_n=3))
print(count_words(sample, min_length=5))

Output:

Counter({'python': 3, 'is': 3, 'amazing': 1, 'powerful': 1, 'everywhere': 1})
[('python', 3), ('amazing', 1), ('powerful', 1)]
Counter({'python': 3, 'amazing': 1, 'powerful': 1, 'everywhere': 1})

The min_length parameter filters out short words, exclude_words removes stop words, and top_n limits the results to the most frequent entries.

Method Comparison

Approach	Best For	Handles Punctuation	Performance
`split()` only	Quick debugging, already clean text	No	Fastest
`translate()` + `Counter`	Clean prose, articles, simple text	Yes	Fast
`re.findall()` + `Counter`	Messy data, logs, web scraping	Yes	Moderate

Conclusion

For most word frequency tasks, the combination of re.findall() for word extraction and collections.Counter for counting provides the best balance of accuracy and flexibility.

Use the simpler str.translate() + split() approach when your text is clean prose without numbers or special formatting.
Always normalize case with .lower() or .casefold() before counting, and remove or filter stop words when you need meaningful analytical results rather than raw frequency data.
By separating text cleaning from frequency counting, you create flexible and maintainable code that produces accurate word statistics regardless of the input source.

Quick Solution with Counter and String Cleaning​

Why Simple .split() Is Not Enough​

Handling Complex Text with Regular Expressions​

Processing Text Files​

Excluding Common Stop Words​

Complete Reusable Function​

Method Comparison​

Conclusion​

Table of Contents

Quick Solution with `Counter` and String Cleaning

Why Simple `.split()` Is Not Enough

Handling Complex Text with Regular Expressions

Processing Text Files

Excluding Common Stop Words

Complete Reusable Function

Method Comparison

Conclusion