How to Extract URLs from a String in Python
Finding links embedded in text is a common task in web scraping, log analysis, and content processing. Python offers several approaches with different accuracy and complexity trade-offs.
Using Regular Expressions
A regex pattern handles most common URL formats quickly:
import re
text = "Visit https://example.com or check www.blog.org for more info"
pattern = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
urls = re.findall(pattern, text)
print(urls)
# ['https://example.com', 'www.blog.org']
For more comprehensive matching including various protocols:
import re
def extract_urls(text):
"""Extract URLs with common protocols."""
pattern = r'''
(?:https?|ftp):// # Protocol
(?:[\w-]+\.)+[\w-]+ # Domain
(?:/[^\s<>"]*)? # Path (optional)
|
www\.[\w-]+\.[\w.-]+ # www URLs
(?:/[^\s<>"]*)?
'''
return re.findall(pattern, text, re.VERBOSE)
text = """
Check https://example.com/path?query=1
Also ftp://files.server.com/data
And www.github.com/user/repo
"""
print(extract_urls(text))
# ['https://example.com/path?query=1', 'ftp://files.server.com/data', 'www.github.com/user/repo']
Regex-based extraction can miss edge cases like URLs without protocols (example.com) or newer TLDs (.pizza, .io). It may also incorrectly match partial strings that look like URLs.
Using urlextract Library
For production accuracy, the urlextract library handles edge cases including TLD validation:
pip install urlextract
from urlextract import URLExtract
extractor = URLExtract()
text = "Visit example.com, check github.io/path, or email test@mail.com"
urls = extractor.find_urls(text)
print(urls)
# ['example.com', 'github.io/path,']
Filter out email addresses if needed:
from urlextract import URLExtract
def extract_urls_only(text):
"""Extract URLs, excluding email addresses."""
extractor = URLExtract()
urls = extractor.find_urls(text)
return [url for url in urls if '@' not in url]
text = "Contact support@example.com or visit example.com"
print(extract_urls_only(text))
# ['example.com']
urlextract maintains an updated TLD list, correctly identifying modern domains like app.dev, site.io, or blog.pizza that simple regex patterns miss.
Extracting URLs from HTML
When working with HTML, use BeautifulSoup to extract links properly:
from bs4 import BeautifulSoup
html = """
<html>
<a href="https://example.com">Link 1</a>
<a href="/relative/path">Link 2</a>
<img src="https://cdn.example.com/image.png">
</html>
"""
soup = BeautifulSoup(html, "lxml")
# Extract all href values
links = [a.get('href') for a in soup.find_all('a', href=True)]
print(links)
# ['https://example.com', '/relative/path']
# Extract all src values
sources = [img.get('src') for img in soup.find_all('img', src=True)]
print(sources)
# ['https://cdn.example.com/image.png']
Validating Extracted URLs
Verify that extracted strings are properly formatted URLs:
from urllib.parse import urlparse
def is_valid_url(url):
"""Check if string is a valid URL with scheme and domain."""
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except Exception:
return False
urls = [
"https://example.com/path",
"not-a-url",
"ftp://files.server.com",
"www.example.com" # Missing scheme
]
for url in urls:
print(f"{url}: {is_valid_url(url)}")
Output:
https://example.com/path: True
not-a-url: False
ftp://files.server.com: True
www.example.com: False
Normalizing Extracted URLs
Clean and standardize URLs after extraction:
from urllib.parse import urlparse, urlunparse
def normalize_url(url):
"""Add scheme if missing and normalize."""
if not url.startswith(('http://', 'https://', 'ftp://')):
url = 'https://' + url
parsed = urlparse(url)
# Lowercase domain
normalized = parsed._replace(netloc=parsed.netloc.lower())
# Remove trailing slash from path
path = normalized.path.rstrip('/') if normalized.path != '/' else '/'
normalized = normalized._replace(path=path)
return urlunparse(normalized)
urls = ["WWW.EXAMPLE.COM", "https://site.com/path/", "example.org"]
for url in urls:
print(normalize_url(url))
Output:
https://www.example.com
https://site.com/path
https://example.org
Complete Extraction Pipeline
Combine extraction, validation, and normalization:
from urlextract import URLExtract
from urllib.parse import urlparse
def extract_and_validate(text):
"""Extract, validate, and normalize URLs from text."""
extractor = URLExtract()
raw_urls = extractor.find_urls(text)
valid_urls = []
for url in raw_urls:
# Skip emails
if '@' in url:
continue
# Add scheme if missing
if not url.startswith(('http://', 'https://')):
url = 'https://' + url
# Validate
parsed = urlparse(url)
if parsed.scheme and parsed.netloc:
valid_urls.append(url)
return list(set(valid_urls)) # Remove duplicates
text = """
Visit example.com for info.
Contact: support@company.com
Source: https://github.com/user/repo
Mirror: github.com/user/repo
"""
print(extract_and_validate(text))
# ['https://example.com', 'https://github.com/user/repo']
Method Comparison
| Method | Accuracy | Speed | Dependencies | Best For |
|---|---|---|---|---|
| Regex | ⭐⭐⭐ | ⚡ Fast | None | Quick scripts |
urlextract | ⭐⭐⭐⭐⭐ | Medium | Package | Production use |
| BeautifulSoup | ⭐⭐⭐⭐⭐ | Medium | Package | HTML content |
urlparse | N/A | ⚡ Fast | Built-in | Validation only |
Summary
- Use regex for quick extraction in controlled environments.
- For production applications,
urlextractprovides accurate results with proper TLD handling. - When processing HTML, use BeautifulSoup to extract
hrefandsrcattributes. - Always validate and normalize extracted URLs before using them.