Skip to main content

How to Count Words in a Text File in Python

Counting words in a text file is one of the most fundamental text processing tasks. Whether you are analyzing a document, processing log files, or building a content management tool, you need a reliable way to determine how many words a file contains. The right approach depends on the size of the file and how precisely you need to define what counts as a "word."

In this guide, you will learn multiple methods for counting words in a text file, from simple whitespace splitting to regex-based extraction that handles contractions and hyphenated words. Each approach includes clear examples, explanations of trade-offs, and guidance on when to use it.

Simple Approach for Small Files

For files that fit comfortably in memory, the simplest method is to read the entire content and use .split():

with open("document.txt", "r", encoding="utf-8") as file:
content = file.read()
words = content.split()

print(f"Word count: {len(words)}")

Example output:

Word count: 342
caution

The .split() method without arguments splits on any whitespace, including spaces, tabs, and newlines, and automatically handles multiple consecutive whitespace characters. This makes it robust for most plain text files.

However, this approach loads the entire file into memory at once, so it is only suitable for files that are reasonably small (typically under 100 MB).

Line-by-Line Processing for Large Files

For files that may be hundreds of megabytes or larger, processing one line at a time keeps memory usage constant regardless of file size:

word_count = 0

with open("large_file.txt", "r", encoding="utf-8") as file:
for line in file:
word_count += len(line.split())

print(f"Word count: {word_count}")

Example output:

Word count: 1847293
note

Python's file iterator reads one line at a time from disk, so only one line is held in memory at any given moment. This approach can handle files of any size without risk of memory exhaustion.

Understanding the Limitation of .split()

The basic .split() method treats punctuation as part of the word it is attached to. This means "hello," and "hello" are considered different tokens:

text = 'Hello, world! Hello world.'
words = text.split()

print(words)
print(f"Word count: {len(words)}")

Output:

['Hello,', 'world!', 'Hello', 'world.']
Word count: 4

While the word count of 4 is correct in this case, the individual tokens still carry punctuation ("Hello," vs "Hello"). This matters when you also need to track unique words or word frequencies, since the same word with and without punctuation would be counted as two different entries.

Accurate Counting with Regular Expressions

To extract only actual words without attached punctuation, use a regular expression:

import re

word_count = 0
pattern = re.compile(r"\w+")

with open("book.txt", "r", encoding="utf-8") as file:
for line in file:
words = pattern.findall(line)
word_count += len(words)

print(f"Word count: {word_count}")

Example output:

Word count: 52481

The \w+ pattern matches sequences of word characters (letters, digits, and underscores), effectively stripping punctuation from every match. Compiling the pattern once with re.compile() avoids recompiling it on every iteration, which improves performance when processing many lines.

Handling Contractions

The \w+ pattern treats "don't" as two separate words: "don" and "t". If you want to count contractions as single words, use a pattern that includes apostrophes:

pattern = re.compile(r"\b[\w']+\b")

This counts "don't" and "it's" as one word each.

Handling Contractions and Hyphenated Words

For the most natural word counting that preserves both contractions and hyphenated compounds, use a more specific pattern:

import re

word_count = 0
pattern = re.compile(r"\b[a-zA-Z]+(?:'[a-zA-Z]+)?(?:-[a-zA-Z]+)*\b")

with open("text.txt", "r", encoding="utf-8") as file:
for line in file:
words = pattern.findall(line)
word_count += len(words)

print(f"Word count: {word_count}")

This pattern counts "don't" as one word and "self-aware" as one word, which aligns more closely with how humans count words in natural text.

Counting Total and Unique Words

To go beyond a simple total count and also analyze word frequencies, combine line-by-line reading with Counter:

import re
from collections import Counter

pattern = re.compile(r"\w+")
word_freq = Counter()

with open("text.txt", "r", encoding="utf-8") as file:
for line in file:
words = pattern.findall(line.lower())
word_freq.update(words)

print(f"Total words: {sum(word_freq.values())}")
print(f"Unique words: {len(word_freq)}")
print(f"Most common: {word_freq.most_common(5)}")

Example output:

Total words: 8432
Unique words: 1247
Most common: [('the', 489), ('and', 312), ('of', 287), ('to', 241), ('a', 198)]
note

The .update() method on a Counter object adds counts from new words to existing totals, so you get an accurate frequency map without ever loading the entire file into memory.

Complete Word Statistics Function

A reusable function that returns comprehensive text statistics in a single pass.

This function validates the file path, handles encoding errors, and computes all metrics in a single iteration over the file.

import re
from collections import Counter
from pathlib import Path

def analyze_text_file(filepath):
"""Analyze a text file and return word statistics."""
pattern = re.compile(r"\w+")
word_freq = Counter()
line_count = 0
char_count = 0

path = Path(filepath)
if not path.is_file():
raise FileNotFoundError(f"File not found: {filepath}")

try:
with open(path, "r", encoding="utf-8") as file:
for line in file:
line_count += 1
char_count += len(line)
words = pattern.findall(line.lower())
word_freq.update(words)
except UnicodeDecodeError:
raise ValueError(f"File contains invalid UTF-8 characters: {filepath}")

return {
"total_words": sum(word_freq.values()),
"unique_words": len(word_freq),
"lines": line_count,
"characters": char_count,
"most_common": word_freq.most_common(10)
}

stats = analyze_text_file("sample.txt")
for key, value in stats.items():
print(f"{key}: {value}")

Example output:

total_words: 1587
unique_words: 643
lines: 42
characters: 9841
most_common: [('the', 87), ('and', 54), ('to', 41), ...]

A Common Mistake: Forgetting to Specify Encoding

A frequent source of cross-platform bugs is opening a file without specifying the encoding:

# Risky: encoding depends on the operating system
with open("data.txt", "r") as f:
content = f.read()

On Windows, the default encoding is often cp1252, while on Linux and macOS it is typically utf-8. The same file can produce different word counts or crash with a UnicodeDecodeError depending on the platform. Always specify the encoding explicitly:

# Safe: consistent behavior across all platforms
with open("data.txt", "r", encoding="utf-8") as f:
content = f.read()

Comparing with the Unix wc Command

To validate your Python word count against the standard wc utility on Unix-like systems:

import subprocess

def compare_with_wc(filepath):
# Python count
with open(filepath, "r", encoding="utf-8") as f:
python_count = sum(len(line.split()) for line in f)

# Unix wc count
result = subprocess.run(["wc", "-w", filepath], capture_output=True, text=True)
wc_count = int(result.stdout.split()[0])

print(f"Python count: {python_count}")
print(f"wc count: {wc_count}")
print(f"Match: {python_count == wc_count}")
note

The wc command and Python's .split() both use whitespace-based word detection, so their results typically match for plain text files. Differences can occur with files containing unusual whitespace characters or binary data.

Method Comparison

MethodAccuracyMemory UsageBest For
read().split()BasicHigh (entire file)Small files under 100 MB
Line-by-line .split()BasicLow (one line)Large files, simple counting
Regex \w+HighLow (one line)Clean word extraction without punctuation
Regex with contraction/hyphen supportHighestLow (one line)Natural language text
Counter with regexHighModerate (frequency map)Word frequency analysis

Conclusion

For quick word counts on small files, file.read().split() is the simplest and fastest approach.

  • For large files, switch to line-by-line processing to keep memory usage constant.
  • When accuracy matters and you need to strip punctuation or handle contractions, use regular expressions with re.findall() to define precisely what counts as a word.
  • For comprehensive text analysis that includes frequencies and unique word counts, combine Counter.update() with line-by-line regex extraction to process files of any size in a single pass.

Regardless of which method you choose, always open files with an explicit encoding="utf-8" to ensure consistent behavior across different operating systems.