Skip to main content

How to Find the Longest Consecutive Substring from a String in Python

Extracting the longest sequence of letters or digits from a mixed string is a common text processing task. Whether you're parsing log files, validating input, or cleaning data, Python offers several approaches ranging from elegant regex patterns to efficient manual algorithms.

Problem Overview

Given a string containing mixed characters, the goal is to identify the longest uninterrupted sequence of alphabetic characters and the longest sequence of numeric digits.

input_text = "abc123XYZdef45gh6789"

# Expected output:
# Longest letters: "XYZ" or "def" or "abc" (all length 3)
# Longest digits: "6789" (length 4)

Using Regular Expressions

The re module provides the most readable solution for pattern matching tasks.

import re

def find_longest_sequences(text):
"""Find longest consecutive letter and digit sequences."""

# Find all groups of consecutive letters
letter_groups = re.findall(r'[a-zA-Z]+', text)

# Find all groups of consecutive digits
digit_groups = re.findall(r'\d+', text)

# Get the longest from each category
longest_letters = max(letter_groups, key=len, default='')
longest_digits = max(digit_groups, key=len, default='')

return longest_letters, longest_digits


# Example usage
text = "abc123XYZdef45gh6789"
letters, digits = find_longest_sequences(text)

print(f"Longest letters: '{letters}' (length {len(letters)})")
print(f"Longest digits: '{digits}' (length {len(digits)})")

Output:

Longest letters: 'XYZdef' (length 6)
Longest digits: '6789' (length 4)
tip

The default='' parameter in max() prevents a ValueError when the input string contains no matches. Always include it when processing user input.

Finding All Matches of Maximum Length

When multiple sequences share the maximum length, you might want all of them:

import re

def find_all_longest(text, pattern):
"""Return all sequences matching the maximum length."""
matches = re.findall(pattern, text)

if not matches:
return []

max_length = len(max(matches, key=len))
return [m for m in matches if len(m) == max_length]


text = "abc123XYZdef45gh6789"

longest_letters = find_all_longest(text, r'[a-zA-Z]+')
longest_digits = find_all_longest(text, r'\d+')

print(f"All longest letter sequences: {longest_letters}")
print(f"All longest digit sequences: {longest_digits}")

Output:

All longest letter sequences: ['XYZdef']
All longest digit sequences: ['6789']

Manual Single-Pass Algorithm

When working in environments where imports should be minimized, or when processing extremely large strings, a manual approach offers more control.

def longest_consecutive(s):
"""
Find longest letter and digit sequences in a single pass.

Time Complexity: O(n)
Space Complexity: O(k) where k is the length of longest sequence
"""
max_letters = ""
max_digits = ""
current_letters = ""
current_digits = ""

for char in s:
if char.isalpha():
current_letters += char
current_digits = "" # Reset digit tracking

if len(current_letters) > len(max_letters):
max_letters = current_letters

elif char.isdigit():
current_digits += char
current_letters = "" # Reset letter tracking

if len(current_digits) > len(max_digits):
max_digits = current_digits

else:
# Non-alphanumeric character breaks both sequences
current_letters = ""
current_digits = ""

return max_letters, max_digits


# Example usage
text = "hello_world123test456789end"
letters, digits = longest_consecutive(text)

print(f"Longest letters: '{letters}'")
print(f"Longest digits: '{digits}'")

Output:

Longest letters: 'hello'
Longest digits: '456789'
info

The manual approach processes the string exactly once, making it ideal for streaming data or memory-constrained environments where you cannot store all matches simultaneously.

Optimized Version Using List Building

String concatenation in loops can be inefficient. This version uses list building for better performance with very long sequences:

def longest_consecutive_optimized(s):
"""Memory-efficient version using character lists."""
max_letters = []
max_digits = []
current_letters = []
current_digits = []

for char in s:
if char.isalpha():
current_letters.append(char)
if current_digits:
current_digits = []

if len(current_letters) > len(max_letters):
max_letters = current_letters.copy()

elif char.isdigit():
current_digits.append(char)
if current_letters:
current_letters = []

if len(current_digits) > len(max_digits):
max_digits = current_digits.copy()

else:
current_letters = []
current_digits = []

return ''.join(max_letters), ''.join(max_digits)

Using itertools.groupby

For a more functional approach, groupby elegantly segments the string:

from itertools import groupby

def longest_with_groupby(text):
"""Use groupby to segment string by character type."""

def char_type(c):
if c.isalpha():
return 'letter'
elif c.isdigit():
return 'digit'
return 'other'

longest_letters = ""
longest_digits = ""

for key, group in groupby(text, key=char_type):
segment = ''.join(group)

if key == 'letter' and len(segment) > len(longest_letters):
longest_letters = segment
elif key == 'digit' and len(segment) > len(longest_digits):
longest_digits = segment

return longest_letters, longest_digits


text = "test123_example4567_data89"
letters, digits = longest_with_groupby(text)

print(f"Longest letters: '{letters}'")
print(f"Longest digits: '{digits}'")

Output:

Longest letters: 'example'
Longest digits: '4567'

Handling Unicode Characters

For international text, consider Unicode letter categories:

import re

def longest_unicode_sequences(text):
"""Handle Unicode letters and digits."""

# \w matches Unicode word characters, subtract digits and underscore
# Using Unicode categories for broader support
letter_pattern = r'[^\W\d_]+' # Letters only
digit_pattern = r'\d+'

letters = re.findall(letter_pattern, text, re.UNICODE)
digits = re.findall(digit_pattern, text)

return (
max(letters, key=len, default=''),
max(digits, key=len, default='')
)


text = "héllo123wörld456日本語789"
letters, digits = longest_unicode_sequences(text)

print(f"Longest letters: '{letters}'")
print(f"Longest digits: '{digits}'")

Output:

Longest letters: 'héllo'
Longest digits: '123'
caution

Unicode handling varies between Python versions and regex engines. Test thoroughly with your expected input character sets.

Method Comparison

MethodTime ComplexitySpace ComplexityBest For
RegexO(n)O(n)Readability, most use cases
ManualO(n)O(k)Memory constraints, streaming
groupbyO(n)O(n)Functional programming style
Unicode regexO(n)O(n)International text

Choose regex for typical applications where code clarity matters. Switch to the manual approach when processing very large files or when you need precise memory control.