How to Find the Longest Consecutive Substring from a String in Python

Extracting the longest sequence of letters or digits from a mixed string is a common text processing task. Whether you're parsing log files, validating input, or cleaning data, Python offers several approaches ranging from elegant regex patterns to efficient manual algorithms.

Problem Overview

Given a string containing mixed characters, the goal is to identify the longest uninterrupted sequence of alphabetic characters and the longest sequence of numeric digits.

input_text = "abc123XYZdef45gh6789"

# Expected output:
# Longest letters: "XYZ" or "def" or "abc" (all length 3)
# Longest digits: "6789" (length 4)

Using Regular Expressions

The re module provides the most readable solution for pattern matching tasks.

import re

def find_longest_sequences(text):
    """Find longest consecutive letter and digit sequences."""
    
    # Find all groups of consecutive letters
    letter_groups = re.findall(r'[a-zA-Z]+', text)
    
    # Find all groups of consecutive digits
    digit_groups = re.findall(r'\d+', text)
    
    # Get the longest from each category
    longest_letters = max(letter_groups, key=len, default='')
    longest_digits = max(digit_groups, key=len, default='')
    
    return longest_letters, longest_digits


# Example usage
text = "abc123XYZdef45gh6789"
letters, digits = find_longest_sequences(text)

print(f"Longest letters: '{letters}' (length {len(letters)})")
print(f"Longest digits: '{digits}' (length {len(digits)})")

Output:

Longest letters: 'XYZdef' (length 6)
Longest digits: '6789' (length 4)

tip

The default='' parameter in max() prevents a ValueError when the input string contains no matches. Always include it when processing user input.

Finding All Matches of Maximum Length

When multiple sequences share the maximum length, you might want all of them:

import re

def find_all_longest(text, pattern):
    """Return all sequences matching the maximum length."""
    matches = re.findall(pattern, text)
    
    if not matches:
        return []
    
    max_length = len(max(matches, key=len))
    return [m for m in matches if len(m) == max_length]


text = "abc123XYZdef45gh6789"

longest_letters = find_all_longest(text, r'[a-zA-Z]+')
longest_digits = find_all_longest(text, r'\d+')

print(f"All longest letter sequences: {longest_letters}")
print(f"All longest digit sequences: {longest_digits}")

Output:

All longest letter sequences: ['XYZdef']
All longest digit sequences: ['6789']

Manual Single-Pass Algorithm

When working in environments where imports should be minimized, or when processing extremely large strings, a manual approach offers more control.

def longest_consecutive(s):
    """
    Find longest letter and digit sequences in a single pass.
    
    Time Complexity: O(n)
    Space Complexity: O(k) where k is the length of longest sequence
    """
    max_letters = ""
    max_digits = ""
    current_letters = ""
    current_digits = ""
    
    for char in s:
        if char.isalpha():
            current_letters += char
            current_digits = ""  # Reset digit tracking
            
            if len(current_letters) > len(max_letters):
                max_letters = current_letters
                
        elif char.isdigit():
            current_digits += char
            current_letters = ""  # Reset letter tracking
            
            if len(current_digits) > len(max_digits):
                max_digits = current_digits
                
        else:
            # Non-alphanumeric character breaks both sequences
            current_letters = ""
            current_digits = ""
    
    return max_letters, max_digits


# Example usage
text = "hello_world123test456789end"
letters, digits = longest_consecutive(text)

print(f"Longest letters: '{letters}'")
print(f"Longest digits: '{digits}'")

Output:

Longest letters: 'hello'
Longest digits: '456789'

info

The manual approach processes the string exactly once, making it ideal for streaming data or memory-constrained environments where you cannot store all matches simultaneously.

Optimized Version Using List Building

String concatenation in loops can be inefficient. This version uses list building for better performance with very long sequences:

def longest_consecutive_optimized(s):
    """Memory-efficient version using character lists."""
    max_letters = []
    max_digits = []
    current_letters = []
    current_digits = []
    
    for char in s:
        if char.isalpha():
            current_letters.append(char)
            if current_digits:
                current_digits = []
            
            if len(current_letters) > len(max_letters):
                max_letters = current_letters.copy()
                
        elif char.isdigit():
            current_digits.append(char)
            if current_letters:
                current_letters = []
            
            if len(current_digits) > len(max_digits):
                max_digits = current_digits.copy()
                
        else:
            current_letters = []
            current_digits = []
    
    return ''.join(max_letters), ''.join(max_digits)

Using itertools.groupby

For a more functional approach, groupby elegantly segments the string:

from itertools import groupby

def longest_with_groupby(text):
    """Use groupby to segment string by character type."""
    
    def char_type(c):
        if c.isalpha():
            return 'letter'
        elif c.isdigit():
            return 'digit'
        return 'other'
    
    longest_letters = ""
    longest_digits = ""
    
    for key, group in groupby(text, key=char_type):
        segment = ''.join(group)
        
        if key == 'letter' and len(segment) > len(longest_letters):
            longest_letters = segment
        elif key == 'digit' and len(segment) > len(longest_digits):
            longest_digits = segment
    
    return longest_letters, longest_digits


text = "test123_example4567_data89"
letters, digits = longest_with_groupby(text)

print(f"Longest letters: '{letters}'")
print(f"Longest digits: '{digits}'")

Output:

Longest letters: 'example'
Longest digits: '4567'

Handling Unicode Characters

For international text, consider Unicode letter categories:

import re

def longest_unicode_sequences(text):
    """Handle Unicode letters and digits."""
    
    # \w matches Unicode word characters, subtract digits and underscore
    # Using Unicode categories for broader support
    letter_pattern = r'[^\W\d_]+'  # Letters only
    digit_pattern = r'\d+'
    
    letters = re.findall(letter_pattern, text, re.UNICODE)
    digits = re.findall(digit_pattern, text)
    
    return (
        max(letters, key=len, default=''),
        max(digits, key=len, default='')
    )


text = "héllo123wörld456日本語789"
letters, digits = longest_unicode_sequences(text)

print(f"Longest letters: '{letters}'")
print(f"Longest digits: '{digits}'")

Output:

Longest letters: 'héllo'
Longest digits: '123'

warning

Unicode handling varies between Python versions and regex engines. Test thoroughly with your expected input character sets.

Method Comparison

Method	Time Complexity	Space Complexity	Best For
Regex	O(n)	O(n)	Readability, most use cases
Manual	O(n)	O(k)	Memory constraints, streaming
groupby	O(n)	O(n)	Functional programming style
Unicode regex	O(n)	O(n)	International text

Choose regex for typical applications where code clarity matters. Switch to the manual approach when processing very large files or when you need precise memory control.

Problem Overview​

Using Regular Expressions​

Finding All Matches of Maximum Length​

Manual Single-Pass Algorithm​

Optimized Version Using List Building​

Using itertools.groupby​

Handling Unicode Characters​

Method Comparison​

Table of Contents

Problem Overview

Using Regular Expressions

Finding All Matches of Maximum Length

Manual Single-Pass Algorithm

Optimized Version Using List Building

Using itertools.groupby

Handling Unicode Characters

Method Comparison