How to Extract URLs from a String in Python

Finding links embedded in text is a common task in web scraping, log analysis, and content processing. Python offers several approaches with different accuracy and complexity trade-offs.

Using Regular Expressions

A regex pattern handles most common URL formats quickly:

import re

text = "Visit https://example.com or check www.blog.org for more info"

pattern = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
urls = re.findall(pattern, text)

print(urls)
# ['https://example.com', 'www.blog.org']

For more comprehensive matching including various protocols:

import re

def extract_urls(text):
    """Extract URLs with common protocols."""
    pattern = r'''
        (?:https?|ftp)://          # Protocol
        (?:[\w-]+\.)+[\w-]+        # Domain
        (?:/[^\s<>"]*)?            # Path (optional)
        |
        www\.[\w-]+\.[\w.-]+       # www URLs
        (?:/[^\s<>"]*)?
    '''
    return re.findall(pattern, text, re.VERBOSE)

text = """
    Check https://example.com/path?query=1
    Also ftp://files.server.com/data
    And www.github.com/user/repo
"""

print(extract_urls(text))
# ['https://example.com/path?query=1', 'ftp://files.server.com/data', 'www.github.com/user/repo']

warning

Regex-based extraction can miss edge cases like URLs without protocols (example.com) or newer TLDs (.pizza, .io). It may also incorrectly match partial strings that look like URLs.

Using urlextract Library

For production accuracy, the urlextract library handles edge cases including TLD validation:

pip install urlextract

from urlextract import URLExtract

extractor = URLExtract()

text = "Visit example.com, check github.io/path, or email test@mail.com"

urls = extractor.find_urls(text)
print(urls)
# ['example.com', 'github.io/path,']

Filter out email addresses if needed:

from urlextract import URLExtract

def extract_urls_only(text):
    """Extract URLs, excluding email addresses."""
    extractor = URLExtract()
    urls = extractor.find_urls(text)
    return [url for url in urls if '@' not in url]

text = "Contact support@example.com or visit example.com"
print(extract_urls_only(text))
# ['example.com']

tip

urlextract maintains an updated TLD list, correctly identifying modern domains like app.dev, site.io, or blog.pizza that simple regex patterns miss.

Extracting URLs from HTML

When working with HTML, use BeautifulSoup to extract links properly:

from bs4 import BeautifulSoup

html = """
<html>
    <a href="https://example.com">Link 1</a>
    <a href="/relative/path">Link 2</a>
    <img src="https://cdn.example.com/image.png">
</html>
"""

soup = BeautifulSoup(html, "lxml")

# Extract all href values
links = [a.get('href') for a in soup.find_all('a', href=True)]
print(links)
# ['https://example.com', '/relative/path']

# Extract all src values
sources = [img.get('src') for img in soup.find_all('img', src=True)]
print(sources)
# ['https://cdn.example.com/image.png']

Validating Extracted URLs

Verify that extracted strings are properly formatted URLs:

from urllib.parse import urlparse

def is_valid_url(url):
    """Check if string is a valid URL with scheme and domain."""
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except Exception:
        return False

urls = [
    "https://example.com/path",
    "not-a-url",
    "ftp://files.server.com",
    "www.example.com"  # Missing scheme
]

for url in urls:
    print(f"{url}: {is_valid_url(url)}")

Output:

https://example.com/path: True
not-a-url: False
ftp://files.server.com: True
www.example.com: False

Normalizing Extracted URLs

Clean and standardize URLs after extraction:

from urllib.parse import urlparse, urlunparse

def normalize_url(url):
    """Add scheme if missing and normalize."""
    if not url.startswith(('http://', 'https://', 'ftp://')):
        url = 'https://' + url
    
    parsed = urlparse(url)
    
    # Lowercase domain
    normalized = parsed._replace(netloc=parsed.netloc.lower())
    
    # Remove trailing slash from path
    path = normalized.path.rstrip('/') if normalized.path != '/' else '/'
    normalized = normalized._replace(path=path)
    
    return urlunparse(normalized)

urls = ["WWW.EXAMPLE.COM", "https://site.com/path/", "example.org"]
for url in urls:
    print(normalize_url(url))

Output:

https://www.example.com
https://site.com/path
https://example.org

Complete Extraction Pipeline

Combine extraction, validation, and normalization:

from urlextract import URLExtract
from urllib.parse import urlparse

def extract_and_validate(text):
    """Extract, validate, and normalize URLs from text."""
    extractor = URLExtract()
    raw_urls = extractor.find_urls(text)
    
    valid_urls = []
    for url in raw_urls:
        # Skip emails
        if '@' in url:
            continue
        
        # Add scheme if missing
        if not url.startswith(('http://', 'https://')):
            url = 'https://' + url
        
        # Validate
        parsed = urlparse(url)
        if parsed.scheme and parsed.netloc:
            valid_urls.append(url)
    
    return list(set(valid_urls))  # Remove duplicates

text = """
    Visit example.com for info.
    Contact: support@company.com
    Source: https://github.com/user/repo
    Mirror: github.com/user/repo
"""

print(extract_and_validate(text))
# ['https://example.com', 'https://github.com/user/repo']

Method Comparison

Method	Accuracy	Speed	Dependencies	Best For
Regex	⭐⭐⭐	⚡ Fast	None	Quick scripts
`urlextract`	⭐⭐⭐⭐⭐	Medium	Package	Production use
BeautifulSoup	⭐⭐⭐⭐⭐	Medium	Package	HTML content
`urlparse`	N/A	⚡ Fast	Built-in	Validation only

Summary

Use regex for quick extraction in controlled environments.
For production applications, urlextract provides accurate results with proper TLD handling.
When processing HTML, use BeautifulSoup to extract href and src attributes.
Always validate and normalize extracted URLs before using them.

Using Regular Expressions​

Using urlextract Library​

Extracting URLs from HTML​

Validating Extracted URLs​

Normalizing Extracted URLs​

Complete Extraction Pipeline​

Method Comparison​

Summary​

Table of Contents