How to Implement Text Tokenization using Regex in Python

Text tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. It is a fundamental step in building compilers, interpreters, and Natural Language Processing (NLP) systems.

While libraries like NLTK exist, building a tokenizer from scratch using Python's re (Regular Expression) module gives you complete control over defining your own grammar and rules.

This guide demonstrates how to build a lexer that categorizes parts of an equation (variables, numbers, operators) into structured tokens.

Defining the Token Structure

A token typically consists of a Type (what is it?) and a Value (what is the actual text?). Python's namedtuple is an excellent choice for this data structure because it is lightweight and immutable.

from collections import namedtuple

# Define a class-like structure for Tokens
Token = namedtuple("Token", ["type", "value"])

# Example instantiation
t = Token("NUM", "42")
print(f"Type: {t.type}, Value: {t.value}")

Output:

Type: NUM, Value: 42

Designing Regex Patterns

To tokenize text, we need to define patterns for every valid element in our language. We map specific names (like NUM or ADD) to regex strings.

import re

# Dictionary mapping Token Names to Regex Patterns
token_specification = {
    "NAME": r"[a-zA-Z_][a-zA-Z_0-9]*",  # Variable names
    "NUM":  r"\d+",                     # Integers
    "ADD":  r"\+",                      # Plus sign
    "SUB":  r"-",                       # Minus sign
    "MUL":  r"\*",                      # Multiply
    "DIV":  r"/",                       # Divide
    "EQ":   r"=",                       # Assignment
    "WS":   r"\s+",                     # Whitespace
}

note

Order matters in regex! If you had a pattern for == and =, you must define == first, otherwise = will match the first half of == prematurely.

Implementing the Tokenizer Logic

We need a function that iterates through the text and matches these patterns.

The Naive vs. Efficient Approach

⛔️ The Incorrect/Inefficient Way: Iterating through the text character by character or using simple .split() often fails to separate operators from numbers without spaces.

text = "val=1+2"
# Splitting by space fails here because there are no spaces
tokens = text.split() 
# Result: ['val=1+2'] -> This is just one chunk, not tokenized!

✅ The Correct Way: We compile all patterns into a single Master Regex using Named Groups ((?P<NAME>pattern)). Python's re.finditer scans the text and returns match objects, allowing us to identify which group matched.

import re
from collections import namedtuple

Token = namedtuple("Token", ["type", "value"])

def generate_tokens(text):
    # 1. Define patterns
    token_specification = {
        "NAME": r"[a-zA-Z_][a-zA-Z_0-9]*",
        "NUM":  r"\d+",
        "ADD":  r"\+",
        "MUL":  r"\*",
        "EQ":   r"=",
        "WS":   r"\s+",
    }

    # 2. Join patterns into one master regex: (?P<NAME>...)|(?P<NUM>...)|...
    regex = "|".join("(?P<%s>%s)" % pair for pair in token_specification.items())

    # 3. Iterate over matches
    scanner = re.finditer(regex, text)
    
    for m in scanner:
        token_type = m.lastgroup # Gets the name of the matching group (e.g., 'NUM')
        token_value = m.group()  # Gets the actual text matched (e.g., '10')
        
        # We might want to skip whitespace depending on the use case
        # if token_type == 'WS': continue 
        
        yield Token(token_type, token_value)

tip

The (?P<name>...) syntax is a regex extension that assigns a name to a capture group. m.lastgroup retrieves this name, allowing us to dynamically determine the token type.

Testing and Output

Let's run the tokenizer on a sample string containing variables, numbers, and operators.

if __name__ == "__main__":
    sample_text = "total = 1 + 2 * 3"
    
    # Generate and convert generator to list for printing
    tokens = list(generate_tokens(sample_text))
    
    # Pretty print results
    for t in tokens:
        print(t)

Output:

Token(type='NAME', value='total')
Token(type='WS', value=' ')
Token(type='EQ', value='=')
Token(type='WS', value=' ')
Token(type='NUM', value='1')
Token(type='WS', value=' ')
Token(type='ADD', value='+')
Token(type='WS', value=' ')
Token(type='NUM', value='2')
Token(type='WS', value=' ')
Token(type='MUL', value='*')
Token(type='WS', value=' ')
Token(type='NUM', value='3')

Conclusion

Building a tokenizer in Python relies on leveraging the re module effectively.

Define Rules: Map your token names to specific regex patterns.
Named Groups: Combine patterns using (?P<NAME>pattern) syntax.
Find Iter: Use re.finditer to scan the text linearly and extract tokens based on the matched group name.

This approach creates a robust foundation for building parsers or interpreters.

Defining the Token Structure​

Designing Regex Patterns​

Implementing the Tokenizer Logic​

The Naive vs. Efficient Approach​

Testing and Output​

Conclusion​

Table of Contents