Skip to main content

How to Extract URLs from a String in Python

Finding links embedded in text is a common task in web scraping, log analysis, and content processing. Python offers several approaches with different accuracy and complexity trade-offs.

Using Regular Expressions

A regex pattern handles most common URL formats quickly:

import re

text = "Visit https://example.com or check www.blog.org for more info"

pattern = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
urls = re.findall(pattern, text)

print(urls)
# ['https://example.com', 'www.blog.org']

For more comprehensive matching including various protocols:

import re

def extract_urls(text):
"""Extract URLs with common protocols."""
pattern = r'''
(?:https?|ftp):// # Protocol
(?:[\w-]+\.)+[\w-]+ # Domain
(?:/[^\s<>"]*)? # Path (optional)
|
www\.[\w-]+\.[\w.-]+ # www URLs
(?:/[^\s<>"]*)?
'''
return re.findall(pattern, text, re.VERBOSE)

text = """
Check https://example.com/path?query=1
Also ftp://files.server.com/data
And www.github.com/user/repo
"""

print(extract_urls(text))
# ['https://example.com/path?query=1', 'ftp://files.server.com/data', 'www.github.com/user/repo']
warning

Regex-based extraction can miss edge cases like URLs without protocols (example.com) or newer TLDs (.pizza, .io). It may also incorrectly match partial strings that look like URLs.

Using urlextract Library

For production accuracy, the urlextract library handles edge cases including TLD validation:

pip install urlextract
from urlextract import URLExtract

extractor = URLExtract()

text = "Visit example.com, check github.io/path, or email test@mail.com"

urls = extractor.find_urls(text)
print(urls)
# ['example.com', 'github.io/path,']

Filter out email addresses if needed:

from urlextract import URLExtract

def extract_urls_only(text):
"""Extract URLs, excluding email addresses."""
extractor = URLExtract()
urls = extractor.find_urls(text)
return [url for url in urls if '@' not in url]

text = "Contact support@example.com or visit example.com"
print(extract_urls_only(text))
# ['example.com']
tip

urlextract maintains an updated TLD list, correctly identifying modern domains like app.dev, site.io, or blog.pizza that simple regex patterns miss.

Extracting URLs from HTML

When working with HTML, use BeautifulSoup to extract links properly:

from bs4 import BeautifulSoup

html = """
<html>
<a href="https://example.com">Link 1</a>
<a href="/relative/path">Link 2</a>
<img src="https://cdn.example.com/image.png">
</html>
"""

soup = BeautifulSoup(html, "lxml")

# Extract all href values
links = [a.get('href') for a in soup.find_all('a', href=True)]
print(links)
# ['https://example.com', '/relative/path']

# Extract all src values
sources = [img.get('src') for img in soup.find_all('img', src=True)]
print(sources)
# ['https://cdn.example.com/image.png']

Validating Extracted URLs

Verify that extracted strings are properly formatted URLs:

from urllib.parse import urlparse

def is_valid_url(url):
"""Check if string is a valid URL with scheme and domain."""
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except Exception:
return False

urls = [
"https://example.com/path",
"not-a-url",
"ftp://files.server.com",
"www.example.com" # Missing scheme
]

for url in urls:
print(f"{url}: {is_valid_url(url)}")

Output:

https://example.com/path: True
not-a-url: False
ftp://files.server.com: True
www.example.com: False

Normalizing Extracted URLs

Clean and standardize URLs after extraction:

from urllib.parse import urlparse, urlunparse

def normalize_url(url):
"""Add scheme if missing and normalize."""
if not url.startswith(('http://', 'https://', 'ftp://')):
url = 'https://' + url

parsed = urlparse(url)

# Lowercase domain
normalized = parsed._replace(netloc=parsed.netloc.lower())

# Remove trailing slash from path
path = normalized.path.rstrip('/') if normalized.path != '/' else '/'
normalized = normalized._replace(path=path)

return urlunparse(normalized)

urls = ["WWW.EXAMPLE.COM", "https://site.com/path/", "example.org"]
for url in urls:
print(normalize_url(url))

Output:

https://www.example.com
https://site.com/path
https://example.org

Complete Extraction Pipeline

Combine extraction, validation, and normalization:

from urlextract import URLExtract
from urllib.parse import urlparse

def extract_and_validate(text):
"""Extract, validate, and normalize URLs from text."""
extractor = URLExtract()
raw_urls = extractor.find_urls(text)

valid_urls = []
for url in raw_urls:
# Skip emails
if '@' in url:
continue

# Add scheme if missing
if not url.startswith(('http://', 'https://')):
url = 'https://' + url

# Validate
parsed = urlparse(url)
if parsed.scheme and parsed.netloc:
valid_urls.append(url)

return list(set(valid_urls)) # Remove duplicates

text = """
Visit example.com for info.
Contact: support@company.com
Source: https://github.com/user/repo
Mirror: github.com/user/repo
"""

print(extract_and_validate(text))
# ['https://example.com', 'https://github.com/user/repo']

Method Comparison

MethodAccuracySpeedDependenciesBest For
Regex⭐⭐⭐⚡ FastNoneQuick scripts
urlextract⭐⭐⭐⭐⭐MediumPackageProduction use
BeautifulSoup⭐⭐⭐⭐⭐MediumPackageHTML content
urlparseN/A⚡ FastBuilt-inValidation only

Summary

  • Use regex for quick extraction in controlled environments.
  • For production applications, urlextract provides accurate results with proper TLD handling.
  • When processing HTML, use BeautifulSoup to extract href and src attributes.
  • Always validate and normalize extracted URLs before using them.