How to Extract and Sort Usernames from Text in Python
Extracting usernames (mentions) from text is a common task in social media analysis and natural language processing. Typically, a username is defined as a sequence of alphanumeric characters or underscores immediately following an @ symbol.
This guide explains how to parse a string to extract these usernames, filter out invalid entries, and sort them based on their frequency of occurrence without using external libraries like Regex (though Regex is powerful, understanding manual string parsing is a crucial fundamental skill).
The Logic of Extraction
To correctly extract usernames, our algorithm must follow these rules:
- Locate: Find the
@symbol. - Validate: Check the characters immediately following it. A valid username character is usually alphanumeric (
a-z,0-9) or an underscore (_). - Stop: Stop capturing characters as soon as an invalid character (like a space, punctuation, or another
@) is encountered. - Filter: Ignore empty matches (e.g., a standalone
@or@!). - Sort: Remove duplicates from the final list but order them based on how frequently they appeared in the original text.
Step 1: Finding the Position of Mentions
We can traverse the string using Python's string method .find(). This allows us to jump from one @ to the next efficiently.
text = "Hello @User1 and @User2"
at_index = text.find("@")
# Loop through the string finding all occurrences
while at_index != -1:
print(f"Found @ at index: {at_index}")
at_index = text.find("@", at_index + 1)
Output:
Found @ at index: 6
Found @ at index: 17
Step 2: Validating and Extracting Characters
Once an @ is found, we look at the subsequent characters.
text = "@Valid! @Invalid"
at_index = 0
# ⛔️ Incorrect: Simple slicing doesn't account for invalid characters stopping the name
# username = text[at_index+1:].split()[0]
# This would fail on "@User!" returning "User!" instead of "User"
# ✅ Correct: Character-by-character validation
username = ""
# Start looking one character after the '@'
for char in text[at_index + 1:]:
if char.isalnum() or char == "_":
username += char
else:
break # Stop at the first invalid character (like '!' or space)
print(f"Extracted: {username}")
Output:
Extracted: Valid
Step 3: Sorting by Frequency
The requirement is to return a unique list of usernames, but sorted by how often they appeared in the text (most frequent first).
# List with duplicates
all_found = ['tutorialreference', 'TutorialReference', 'tutorialreference', 'test']
# ✅ Correct: Deduplicate using set(), but sort using the count from the original list
unique_sorted = sorted(
list(set(all_found)),
key=lambda x: all_found.count(x),
reverse=True
)
print(unique_sorted)
Output:
['tutorialreference', 'test', 'TutorialReference']
In the output above, 'tutorialreference' appears first because it appeared twice in the source list. 'TutorialReference' and 'test' appeared once. Stability of sorting for ties depends on the Python implementation, but the frequency logic holds.
Complete Code Solution
Here is the robust after_at function that implements the logic: locate, extract, validate, and sort.
def after_at(text):
"""
Extracts usernames starting with @ from text.
Valid characters: Alphanumeric and Underscore.
Returns unique list sorted by frequency (descending).
"""
usernames = []
at_index = text.find("@")
# 1. Loop through all '@' occurrences
while at_index != -1:
username = ""
# 2. Extract valid characters immediately following '@'
# We slice from at_index + 1 to the end of string
for char in text[at_index + 1 : ]:
if char.isalnum() or char == "_" or char.isalpha():
username += char
else:
# Break immediately upon hitting a non-username char
break
# 3. Add to list only if a valid username was captured
if username:
usernames.append(username)
# 4. Find the next '@' starting from the next position
at_index = text.find("@", at_index + 1)
# 5. Remove duplicates and sort by count
# We use 'usernames.count' on the original list to determine frequency
result = sorted(
list(set(usernames)),
key=lambda x: usernames.count(x),
reverse=True
)
return result
if __name__ == "__main__":
# Test Cases
print(f"Test 1: {after_at('@TutorialReference @tutorialreference I won in the @ competition')}")
print(f"Test 2: {after_at('@!TutorialReference @tutorialreference I won in the competition')}")
print(f"Test 3: {after_at('I won in the competition@')}")
print(f"Test 4: {after_at('@!@LabETutorialReferencex @tutorialreference I won in the @TutorialReference competition @experiment')}")
Execution Output:
Test 1: ['tutorialreference', 'TutorialReference']
Test 2: ['tutorialreference']
Test 3: []
Test 4: ['tutorialreference', 'TutorialReference', 'experiment', 'LabETutorialReferencex']
For very large texts, the .count() method inside sorted can be slow (quadratic time complexity). Using collections.Counter is a more performant alternative for frequency analysis in production environments.
Conclusion
Extracting patterns like usernames manually gives you granular control over what constitutes a "valid" character without relying on complex Regular Expressions.
- Iterate using
.find()to locate delimiters (@). - Validate strictly using
.isalnum()loop logic. - Process the resulting list to filter duplicates and apply custom sorting logic based on frequency.