How to Count Words in a Text File in Python
Counting words in a text file is one of the most fundamental text processing tasks. Whether you are analyzing a document, processing log files, or building a content management tool, you need a reliable way to determine how many words a file contains. The right approach depends on the size of the file and how precisely you need to define what counts as a "word."
In this guide, you will learn multiple methods for counting words in a text file, from simple whitespace splitting to regex-based extraction that handles contractions and hyphenated words. Each approach includes clear examples, explanations of trade-offs, and guidance on when to use it.
Simple Approach for Small Files
For files that fit comfortably in memory, the simplest method is to read the entire content and use .split():
with open("document.txt", "r", encoding="utf-8") as file:
content = file.read()
words = content.split()
print(f"Word count: {len(words)}")
Example output:
Word count: 342
The .split() method without arguments splits on any whitespace, including spaces, tabs, and newlines, and automatically handles multiple consecutive whitespace characters. This makes it robust for most plain text files.
However, this approach loads the entire file into memory at once, so it is only suitable for files that are reasonably small (typically under 100 MB).
Line-by-Line Processing for Large Files
For files that may be hundreds of megabytes or larger, processing one line at a time keeps memory usage constant regardless of file size:
word_count = 0
with open("large_file.txt", "r", encoding="utf-8") as file:
for line in file:
word_count += len(line.split())
print(f"Word count: {word_count}")
Example output:
Word count: 1847293
Python's file iterator reads one line at a time from disk, so only one line is held in memory at any given moment. This approach can handle files of any size without risk of memory exhaustion.
Understanding the Limitation of .split()
The basic .split() method treats punctuation as part of the word it is attached to. This means "hello," and "hello" are considered different tokens:
text = 'Hello, world! Hello world.'
words = text.split()
print(words)
print(f"Word count: {len(words)}")
Output:
['Hello,', 'world!', 'Hello', 'world.']
Word count: 4
While the word count of 4 is correct in this case, the individual tokens still carry punctuation ("Hello," vs "Hello"). This matters when you also need to track unique words or word frequencies, since the same word with and without punctuation would be counted as two different entries.
Accurate Counting with Regular Expressions
To extract only actual words without attached punctuation, use a regular expression:
import re
word_count = 0
pattern = re.compile(r"\w+")
with open("book.txt", "r", encoding="utf-8") as file:
for line in file:
words = pattern.findall(line)
word_count += len(words)
print(f"Word count: {word_count}")
Example output:
Word count: 52481
The \w+ pattern matches sequences of word characters (letters, digits, and underscores), effectively stripping punctuation from every match. Compiling the pattern once with re.compile() avoids recompiling it on every iteration, which improves performance when processing many lines.
The \w+ pattern treats "don't" as two separate words: "don" and "t". If you want to count contractions as single words, use a pattern that includes apostrophes:
pattern = re.compile(r"\b[\w']+\b")
This counts "don't" and "it's" as one word each.
Handling Contractions and Hyphenated Words
For the most natural word counting that preserves both contractions and hyphenated compounds, use a more specific pattern:
import re
word_count = 0
pattern = re.compile(r"\b[a-zA-Z]+(?:'[a-zA-Z]+)?(?:-[a-zA-Z]+)*\b")
with open("text.txt", "r", encoding="utf-8") as file:
for line in file:
words = pattern.findall(line)
word_count += len(words)
print(f"Word count: {word_count}")
This pattern counts "don't" as one word and "self-aware" as one word, which aligns more closely with how humans count words in natural text.
Counting Total and Unique Words
To go beyond a simple total count and also analyze word frequencies, combine line-by-line reading with Counter:
import re
from collections import Counter
pattern = re.compile(r"\w+")
word_freq = Counter()
with open("text.txt", "r", encoding="utf-8") as file:
for line in file:
words = pattern.findall(line.lower())
word_freq.update(words)
print(f"Total words: {sum(word_freq.values())}")
print(f"Unique words: {len(word_freq)}")
print(f"Most common: {word_freq.most_common(5)}")
Example output:
Total words: 8432
Unique words: 1247
Most common: [('the', 489), ('and', 312), ('of', 287), ('to', 241), ('a', 198)]
The .update() method on a Counter object adds counts from new words to existing totals, so you get an accurate frequency map without ever loading the entire file into memory.
Complete Word Statistics Function
A reusable function that returns comprehensive text statistics in a single pass.
This function validates the file path, handles encoding errors, and computes all metrics in a single iteration over the file.
import re
from collections import Counter
from pathlib import Path
def analyze_text_file(filepath):
"""Analyze a text file and return word statistics."""
pattern = re.compile(r"\w+")
word_freq = Counter()
line_count = 0
char_count = 0
path = Path(filepath)
if not path.is_file():
raise FileNotFoundError(f"File not found: {filepath}")
try:
with open(path, "r", encoding="utf-8") as file:
for line in file:
line_count += 1
char_count += len(line)
words = pattern.findall(line.lower())
word_freq.update(words)
except UnicodeDecodeError:
raise ValueError(f"File contains invalid UTF-8 characters: {filepath}")
return {
"total_words": sum(word_freq.values()),
"unique_words": len(word_freq),
"lines": line_count,
"characters": char_count,
"most_common": word_freq.most_common(10)
}
stats = analyze_text_file("sample.txt")
for key, value in stats.items():
print(f"{key}: {value}")
Example output:
total_words: 1587
unique_words: 643
lines: 42
characters: 9841
most_common: [('the', 87), ('and', 54), ('to', 41), ...]
A Common Mistake: Forgetting to Specify Encoding
A frequent source of cross-platform bugs is opening a file without specifying the encoding:
# Risky: encoding depends on the operating system
with open("data.txt", "r") as f:
content = f.read()
On Windows, the default encoding is often cp1252, while on Linux and macOS it is typically utf-8. The same file can produce different word counts or crash with a UnicodeDecodeError depending on the platform. Always specify the encoding explicitly:
# Safe: consistent behavior across all platforms
with open("data.txt", "r", encoding="utf-8") as f:
content = f.read()
Comparing with the Unix wc Command
To validate your Python word count against the standard wc utility on Unix-like systems:
import subprocess
def compare_with_wc(filepath):
# Python count
with open(filepath, "r", encoding="utf-8") as f:
python_count = sum(len(line.split()) for line in f)
# Unix wc count
result = subprocess.run(["wc", "-w", filepath], capture_output=True, text=True)
wc_count = int(result.stdout.split()[0])
print(f"Python count: {python_count}")
print(f"wc count: {wc_count}")
print(f"Match: {python_count == wc_count}")
The wc command and Python's .split() both use whitespace-based word detection, so their results typically match for plain text files. Differences can occur with files containing unusual whitespace characters or binary data.
Method Comparison
| Method | Accuracy | Memory Usage | Best For |
|---|---|---|---|
read().split() | Basic | High (entire file) | Small files under 100 MB |
Line-by-line .split() | Basic | Low (one line) | Large files, simple counting |
Regex \w+ | High | Low (one line) | Clean word extraction without punctuation |
| Regex with contraction/hyphen support | Highest | Low (one line) | Natural language text |
Counter with regex | High | Moderate (frequency map) | Word frequency analysis |
Conclusion
For quick word counts on small files, file.read().split() is the simplest and fastest approach.
- For large files, switch to line-by-line processing to keep memory usage constant.
- When accuracy matters and you need to strip punctuation or handle contractions, use regular expressions with
re.findall()to define precisely what counts as a word. - For comprehensive text analysis that includes frequencies and unique word counts, combine
Counter.update()with line-by-line regex extraction to process files of any size in a single pass.
Regardless of which method you choose, always open files with an explicit encoding="utf-8" to ensure consistent behavior across different operating systems.