How to Extract and Sort Usernames from Text in Python

Extracting usernames (mentions) from text is a common task in social media analysis and natural language processing. Typically, a username is defined as a sequence of alphanumeric characters or underscores immediately following an @ symbol.

This guide explains how to parse a string to extract these usernames, filter out invalid entries, and sort them based on their frequency of occurrence without using external libraries like Regex (though Regex is powerful, understanding manual string parsing is a crucial fundamental skill).

The Logic of Extraction

To correctly extract usernames, our algorithm must follow these rules:

Locate: Find the @ symbol.
Validate: Check the characters immediately following it. A valid username character is usually alphanumeric (a-z, 0-9) or an underscore (_).
Stop: Stop capturing characters as soon as an invalid character (like a space, punctuation, or another @) is encountered.
Filter: Ignore empty matches (e.g., a standalone @ or @!).
Sort: Remove duplicates from the final list but order them based on how frequently they appeared in the original text.

Step 1: Finding the Position of Mentions

We can traverse the string using Python's string method .find(). This allows us to jump from one @ to the next efficiently.

text = "Hello @User1 and @User2"
at_index = text.find("@")

# Loop through the string finding all occurrences
while at_index != -1:
    print(f"Found @ at index: {at_index}")
    at_index = text.find("@", at_index + 1)

Output:

Found @ at index: 6
Found @ at index: 17

Step 2: Validating and Extracting Characters

Once an @ is found, we look at the subsequent characters.

text = "@Valid! @Invalid"
at_index = 0 

# ⛔️ Incorrect: Simple slicing doesn't account for invalid characters stopping the name
# username = text[at_index+1:].split()[0] 
# This would fail on "@User!" returning "User!" instead of "User"

# ✅ Correct: Character-by-character validation
username = ""
# Start looking one character after the '@'
for char in text[at_index + 1:]:
    if char.isalnum() or char == "_":
        username += char
    else:
        break # Stop at the first invalid character (like '!' or space)

print(f"Extracted: {username}")

Output:

Extracted: Valid

Step 3: Sorting by Frequency

The requirement is to return a unique list of usernames, but sorted by how often they appeared in the text (most frequent first).

# List with duplicates
all_found = ['tutorialreference', 'TutorialReference', 'tutorialreference', 'test']

# ✅ Correct: Deduplicate using set(), but sort using the count from the original list
unique_sorted = sorted(
    list(set(all_found)), 
    key=lambda x: all_found.count(x), 
    reverse=True
)

print(unique_sorted)

Output:

['tutorialreference', 'test', 'TutorialReference']

note

In the output above, 'tutorialreference' appears first because it appeared twice in the source list. 'TutorialReference' and 'test' appeared once. Stability of sorting for ties depends on the Python implementation, but the frequency logic holds.

Complete Code Solution

Here is the robust after_at function that implements the logic: locate, extract, validate, and sort.

def after_at(text):
    """
    Extracts usernames starting with @ from text.
    Valid characters: Alphanumeric and Underscore.
    Returns unique list sorted by frequency (descending).
    """
    usernames = []
    at_index = text.find("@")

    # 1. Loop through all '@' occurrences
    while at_index != -1:
        username = ""
        
        # 2. Extract valid characters immediately following '@'
        # We slice from at_index + 1 to the end of string
        for char in text[at_index + 1 : ]:
            if char.isalnum() or char == "_" or char.isalpha():
                username += char
            else:
                # Break immediately upon hitting a non-username char
                break
        
        # 3. Add to list only if a valid username was captured
        if username:
            usernames.append(username)
            
        # 4. Find the next '@' starting from the next position
        at_index = text.find("@", at_index + 1)

    # 5. Remove duplicates and sort by count
    # We use 'usernames.count' on the original list to determine frequency
    result = sorted(
        list(set(usernames)), 
        key=lambda x: usernames.count(x), 
        reverse=True
    )

    return result

if __name__ == "__main__":
    # Test Cases
    print(f"Test 1: {after_at('@TutorialReference @tutorialreference I won in the @ competition')}")
    print(f"Test 2: {after_at('@!TutorialReference @tutorialreference I won in the competition')}")
    print(f"Test 3: {after_at('I won in the competition@')}")
    print(f"Test 4: {after_at('@!@LabETutorialReferencex @tutorialreference I won in the @TutorialReference competition @experiment')}")

Execution Output:

Test 1: ['tutorialreference', 'TutorialReference']
Test 2: ['tutorialreference']
Test 3: []
Test 4: ['tutorialreference', 'TutorialReference', 'experiment', 'LabETutorialReferencex']

tip

For very large texts, the .count() method inside sorted can be slow (quadratic time complexity). Using collections.Counter is a more performant alternative for frequency analysis in production environments.

Conclusion

Extracting patterns like usernames manually gives you granular control over what constitutes a "valid" character without relying on complex Regular Expressions.

Iterate using .find() to locate delimiters (@).
Validate strictly using .isalnum() loop logic.
Process the resulting list to filter duplicates and apply custom sorting logic based on frequency.

The Logic of Extraction​

Step 1: Finding the Position of Mentions​

Step 2: Validating and Extracting Characters​

Step 3: Sorting by Frequency​

Complete Code Solution​

Conclusion​

Table of Contents

The Logic of Extraction

Step 1: Finding the Position of Mentions

Step 2: Validating and Extracting Characters

Step 3: Sorting by Frequency

Complete Code Solution

Conclusion