How to Find All Overlapping Substrings of a String in Python
When working with strings in Python, you may need to extract every possible overlapping substring grouped by length. This is useful in text analysis, pattern matching, bioinformatics (e.g., analyzing DNA sequences), and computational linguistics where sliding window operations over strings are required.
In this guide, you'll learn how to generate all combinations of overlapping substrings from a string, organized by substring length, with clear explanations and practical examples.
Understanding the Problemā
Given a string, the goal is to produce a list of lists where each inner list contains all overlapping substrings of a specific length. The substrings are generated using a sliding window approach.
For example, with the string "ABCDE":
| Window Size | Substrings |
|---|---|
| 0 | ['', '', '', '', '', ''] |
| 1 | ['A', 'B', 'C', 'D', 'E'] |
| 2 | ['AB', 'BC', 'CD', 'DE'] |
| 3 | ['ABC', 'BCD', 'CDE'] |
| 4 | ['ABCD', 'BCDE'] |
| 5 | ['ABCDE'] |
Notice that as the window size increases, the number of substrings decreases. For a string of length n and window size w, there are n - w + 1 substrings.
Using List Comprehension with String Slicingā
The most straightforward approach combines a for loop to control the window size and list comprehension with slicing to extract substrings at each size.
text = "ABCDEFG"
result = []
for size in range(len(text) + 1):
substrings = [text[i:i + size] for i in range(len(text) - size + 1)]
result.append(substrings)
print(result)
Output:
[['', '', '', '', '', '', '', ''], ['A', 'B', 'C', 'D', 'E', 'F', 'G'], ['AB', 'BC', 'CD', 'DE', 'EF', 'FG'], ['ABC', 'BCD', 'CDE', 'DEF', 'EFG'], ['ABCD', 'BCDE', 'CDEF', 'DEFG'], ['ABCDE', 'BCDEF', 'CDEFG'], ['ABCDEF', 'BCDEFG'], ['ABCDEFG']]
How it works:
- The outer loop iterates
sizefrom0to the length of the string (inclusive). - For each
size, the list comprehension slides a window across the string:text[i:i + size]extracts a substring of lengthsizestarting at indexi. range(len(text) - size + 1)ensures the window doesn't extend past the end of the string.- Each group of same-length substrings is appended to the result list.
Using Nested List Comprehensionā
You can condense the entire operation into a single nested list comprehension for a more compact solution:
text = "ABCDEFG"
result = [
[text[i:i + size] for i in range(len(text) - size + 1)]
for size in range(len(text) + 1)
]
print(result)
Output:
[['', '', '', '', '', '', '', ''], ['A', 'B', 'C', 'D', 'E', 'F', 'G'], ['AB', 'BC', 'CD', 'DE', 'EF', 'FG'], ['ABC', 'BCD', 'CDE', 'DEF', 'EFG'], ['ABCD', 'BCDE', 'CDEF', 'DEFG'], ['ABCDE', 'BCDEF', 'CDEFG'], ['ABCDEF', 'BCDEFG'], ['ABCDEFG']]
This produces the same result in fewer lines. The outer comprehension controls the window size, and the inner comprehension generates the substrings.
Use the explicit loop when you need to add extra logic (like filtering or transforming substrings). Use the nested comprehension when brevity matters and the logic is straightforward.
Excluding Empty Substringsā
The window size of 0 produces a list of empty strings, which is often not useful. You can skip it by starting the range at 1:
text = "ABCDE"
result = [
[text[i:i + size] for i in range(len(text) - size + 1)]
for size in range(1, len(text) + 1)
]
for group in result:
print(group)
Output:
['A', 'B', 'C', 'D', 'E']
['AB', 'BC', 'CD', 'DE']
['ABC', 'BCD', 'CDE']
['ABCD', 'BCDE']
['ABCDE']
Creating a Reusable Functionā
For cleaner, reusable code, wrap the logic in a function with options to include or exclude empty substrings:
def get_overlapping_substrings(text: str, include_empty: bool = False) -> list[list[str]]:
"""Generate all overlapping substrings grouped by length.
Args:
text: The input string.
include_empty: Whether to include the group of empty strings (size 0).
Returns:
A list of lists, where each inner list contains substrings of the same length.
"""
start = 0 if include_empty else 1
return [
[text[i:i + size] for i in range(len(text) - size + 1)]
for size in range(start, len(text) + 1)
]
# Usage
result = get_overlapping_substrings("Code")
for group in result:
print(group)
Output:
['C', 'o', 'd', 'e']
['Co', 'od', 'de']
['Cod', 'ode']
['Code']
Generating a Flat List of All Substringsā
If you need all overlapping substrings in a single flat list instead of grouped by length, you can flatten the result:
text = "Code"
all_substrings = [
text[i:i + size]
for size in range(1, len(text) + 1)
for i in range(len(text) - size + 1)
]
print(all_substrings)
Output:
['C', 'o', 'd', 'e', 'Co', 'od', 'de', 'Cod', 'ode', 'Code']
For a string of length n, the total number of non-empty substrings is n Ć (n + 1) / 2. For "Code" (length 4), that's 4 Ć 5 / 2 = 10 substrings.
Common Mistake: Off-by-One Errors in Range Boundariesā
A frequent mistake is using incorrect range boundaries, which either misses the full-length substring or causes index errors.
Wrong approach - missing the full string:
text = "Code"
# Using range(len(text)) instead of range(len(text) + 1)
result = [
[text[i:i + size] for i in range(len(text) - size + 1)]
for size in range(1, len(text)) # Missing size == len(text)
]
for group in result:
print(group)
Output:
['C', 'o', 'd', 'e']
['Co', 'od', 'de']
['Cod', 'ode']
The full string "Code" is missing because range(1, len(text)) stops at 3, never reaching the window size of 4.
Correct approach - use len(text) + 1:
text = "Code"
result = [
[text[i:i + size] for i in range(len(text) - size + 1)]
for size in range(1, len(text) + 1) # Includes the full string
]
for group in result:
print(group)
Output:
['C', 'o', 'd', 'e']
['Co', 'od', 'de']
['Cod', 'ode']
['Code']
Performance Considerationsā
The algorithm has a time complexity of O(n²) because for each window size (up to n), it generates a decreasing number of substrings. The total work is proportional to n + (n-1) + (n-2) + ... + 1 = n(n+1)/2.
| String Length | Total Substrings | Approximate Operations |
|---|---|---|
| 10 | 55 | ~55 |
| 100 | 5,050 | ~5,050 |
| 1,000 | 500,500 | ~500,500 |
| 10,000 | 50,005,000 | ~50 million |
Each substring is a new string object in Python. For very long strings, generating all overlapping substrings can consume significant memory. Consider using generators instead of lists if you're processing substrings one at a time:
def overlapping_substrings_gen(text):
for size in range(1, len(text) + 1):
for i in range(len(text) - size + 1):
yield text[i:i + size]
# Process substrings one at a time without storing all in memory
for substr in overlapping_substrings_gen("LargeString"):
pass # process each substring
Conclusionā
Generating all overlapping substrings of a string in Python is straightforward with list comprehension and string slicing:
- Use a loop with list comprehension for clarity and the ability to add extra logic.
- Use nested list comprehension for a concise one-liner.
- Start the range at
1instead of0to exclude empty substrings. - Use a generator for memory-efficient processing of large strings.
- Watch out for off-by-one errors - use
range(len(text) + 1)to include the full string.
The algorithm runs in O(n²) time, which is efficient for typical string lengths but should be used carefully with very long strings.