How to Remove Characters That Don't Match a Pattern (Regex) in JavaScript
A common data sanitization task is to "whitelist" a set of allowed characters in a string and remove everything else. For example, you might want to strip a string of all characters that are not alphanumeric. The most powerful and flexible way to achieve this is by using the String.prototype.replace() method with a negated character set in a regular expression.
This guide will teach you how to construct a regular expression to remove or replace any characters that don't match a specific pattern, explaining the key components of the regex that make this possible.
The Core Method: replace() with a Negated Character Set
The most direct way to remove unwanted characters is to create a regular expression that matches them and replace them with an empty string.
Problem: you have a string with various symbols, and you want to keep only the standard English letters and numbers.
// Problem: Remove all symbols and non-alphanumeric characters.
let messyString = 'User@_Name!#123-Test';
Solution: this function uses a regular expression to find any character that is not a letter or a number and removes it.
function sanitizeAlphanumeric(str) {
// This regex finds any character that is NOT a-z or 0-9 (case-insensitive)
// and replaces it with an empty string.
return str.replace(/[^a-z0-9]/gi, '');
}
// Example Usage:
let messyString = 'User@_Name!#123-Test';
let cleanString = sanitizeAlphanumeric(messyString);
console.log(cleanString);
Output:
UserName123Test
The Regular Expression Explained
The regex /[^a-z0-9]/gi is the key to this operation. Let's break down its components:
/ ... /: These are the delimiters that mark the beginning and end of the regular expression pattern.[...]: This is a character set. It defines a group of characters to match.^: When^is the first character inside a character set, it acts as a negation. It inverts the set, matching any character that is not in the list. This is the "don't match" part of the logic.a-z0-9: This defines two ranges: all lowercase letters from 'a' to 'z' and all digits from '0' to '9'. These are the characters we want to keep.g: This is the global flag. It is crucial. It tells thereplace()method to replace all matches it finds, not just the first one.i: This is the case-insensitive flag. By including it, oura-zrange will also match uppercase lettersA-Z.
So, /[^a-z0-9]/gi translates to: "Find every character that is not a letter or a number, regardless of case, and do it for the whole string."
Customizing the Pattern
How to Preserve Additional Characters (e.g., Spaces, Hyphens)
To "whitelist" other characters, simply add them to the negated character set.
Solution: let's modify the function to keep letters, numbers, spaces, and hyphens.
function sanitizeWithWhitelist(str) {
// We added a space and a hyphen to the character set.
return str.replace(/[^a-z0-9 -]/gi, '');
}
let sentence = 'My file is named "document-123.pdf"!';
let cleanSentence = sanitizeWithWhitelist(sentence);
console.log(cleanSentence);
Output:
My file is named document-123pdf
Replacing Each Non-Match vs. Groups of Non-Matches (The + Quantifier)
By default, the regex finds and replaces each non-matching character individually. You can add a + quantifier to group consecutive non-matching characters into a single replacement.
Problem: you want to replace any sequence of one or more symbols with a single underscore.
let dirtyData = 'value1!@#value2$%^value3';
Solution: add a + after the character set to match one or more occurrences.
let withSingleUnderscore = dirtyData.replace(/[^a-z0-9]+/gi, '_');
console.log(withSingleUnderscore); // Output: value1_value2_value3
// Compare this to the result without the `+` quantifier:
let withMultipleUnderscores = dirtyData.replace(/[^a-z0-9]/gi, '_');
console.log(withMultipleUnderscores); // Output: value1___value2___value3
Output:
value1_value2_value3
value1___value2___value3
Practical Example: Sanitizing a Username
This is a classic use case. A user is creating a username, but you only want to allow letters, numbers, and the underscore character.
function sanitizeUsername(username) {
// Allow letters, numbers, and the underscore. Remove everything else.
return username.replace(/[^a-z0-9_]/gi, '');
}
// Example Usage:
let userInput = ' test_user-123! ';
let sanitizedInput = sanitizeUsername(userInput);
console.log(sanitizedInput); // Output: 'test_user123'
Output:
test_user123
Conclusion
The String.prototype.replace() method, combined with a negated character set, provides a powerful and concise way to sanitize strings by removing any characters that don't match a "whitelist."
- The key to the pattern is the caret (
^) at the beginning of the character set:[^...]. - Define the characters you want to keep inside the brackets (e.g.,
[^a-z0-9]). - Always use the
g(global) flag to ensure all non-matching characters are replaced, not just the first. - Use the
i(case-insensitive) flag if you want to treat uppercase and lowercase letters the same.