How to Extract Text from HTML in Python

Use BeautifulSoup to strip tags and get clean, human-readable text.

1. Basic Extraction

from bs4 import BeautifulSoup

html = "<p>Hello <b>World</b>!</p>"
soup = BeautifulSoup(html, "lxml")

text = soup.get_text()
print(text) # "Hello World!"

2. Clean Extraction (Recommended)

Use separator and strip to prevent words from merging.

from bs4 import BeautifulSoup

html = "<div>Line 1</div><div>Line 2</div>"
soup = BeautifulSoup(html, "lxml")

text = soup.get_text(separator=" ", strip=True)
print(text) # "Line 1 Line 2"

3. Remove Script/Style Tags First

These tags contain non-human code. Delete them before extracting text.

for junk in soup(["script", "style"]):
    junk.decompose() # Removes from tree

clean_text = soup.get_text()

4. Target Specific Elements

Use CSS selectors to get text from a specific section.

# Only get text from article body
for p in soup.select("article.body p"):
    print(p.get_text())

5. Summary

Method	Purpose
`get_text()`	All visible text
`separator=" "`	Prevent words from joining
`strip=True`	Remove whitespace
`decompose()`	Delete unwanted tags

Conclusion: Use lxml as the parser (fast) and always remove <script>/<style> before extraction.

How to Extract Text from HTML in Python

BeautifulSoup provides reliable methods to strip HTML tags and extract clean, readable text from web pages and HTML documents.

Basic Text Extraction

The get_text() method removes all HTML tags and returns the text content:

from bs4 import BeautifulSoup

html = "<p>Hello <b>World</b>!</p>"
soup = BeautifulSoup(html, "lxml")

text = soup.get_text()
print(text)  # Hello World!

Install the required packages:

pip install beautifulsoup4 lxml

Preventing Text Concatenation

Without a separator, adjacent elements merge their text:

from bs4 import BeautifulSoup

html = "<div>First</div><div>Second</div>"
soup = BeautifulSoup(html, "lxml")

# Without separator - words merge
print(soup.get_text())                          # FirstSecond

# With separator - clean spacing
print(soup.get_text(separator=" "))             # First Second

# With strip to remove extra whitespace
print(soup.get_text(separator=" ", strip=True)) # First Second

tip

Always use separator=" " and strip=True for clean output. This handles nested tags, line breaks, and inconsistent whitespace in the source HTML.

Removing Script and Style Content

Script and style tags contain code, not readable text. Remove them before extraction:

from bs4 import BeautifulSoup

html = """
<html>
<head>
    <style>body { color: red; }</style>
    <script>alert('hello');</script>
</head>
<body>
    <p>Actual content here.</p>
</body>
</html>
"""

soup = BeautifulSoup(html, "lxml")

# Remove unwanted tags
for element in soup(["script", "style", "meta", "link"]):
    element.decompose()

text = soup.get_text(separator=" ", strip=True)
print(text)  # Actual content here.

Extracting from Specific Elements

Use CSS selectors to target particular sections:

from bs4 import BeautifulSoup

html = """
<article>
    <header>Article Title</header>
    <div class="content">
        <p>First paragraph.</p>
        <p>Second paragraph.</p>
    </div>
    <footer>Author info</footer>
</article>
"""

soup = BeautifulSoup(html, "lxml")

# Get only content div text
content = soup.select_one(".content")
print(content.get_text(separator=" ", strip=True))
# First paragraph. Second paragraph.

# Iterate over specific elements
for p in soup.select(".content p"):
    print(p.get_text())

Output:

First paragraph. Second paragraph.
First paragraph.
Second paragraph.

Complete Extraction Function

A reusable function for clean text extraction:

from bs4 import BeautifulSoup
import re

def extract_text(html, selector=None):
    """Extract clean text from HTML."""
    soup = BeautifulSoup(html, "lxml")
    
    # Remove non-content elements
    for tag in soup(["script", "style", "meta", "link", "noscript"]):
        tag.decompose()
    
    # Target specific element if selector provided
    if selector:
        element = soup.select_one(selector)
        if not element:
            return ""
        soup = element
    
    # Extract and clean text
    text = soup.get_text(separator=" ", strip=True)
    
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

# Usage
html = "<div><p>Hello   World</p><script>code();</script></div>"
print(extract_text(html))  # Hello World

Extracting Text with Structure

Preserve some structure by handling specific tags:

from bs4 import BeautifulSoup

def extract_with_newlines(html):
    """Extract text preserving paragraph breaks."""
    soup = BeautifulSoup(html, "lxml")
    
    # Remove scripts and styles
    for tag in soup(["script", "style"]):
        tag.decompose()
    
    # Add newlines after block elements
    for br in soup.find_all("br"):
        br.replace_with("\n")
    
    for tag in soup.find_all(["p", "div", "h1", "h2", "h3", "li"]):
        tag.insert_after("\n")
    
    text = soup.get_text()
    
    # Clean up multiple newlines
    lines = [line.strip() for line in text.splitlines()]
    return "\n".join(line for line in lines if line)

html = """
<h1>Title</h1>
<p>First paragraph.</p>
<p>Second paragraph.</p>
<ul>
    <li>Item one</li>
    <li>Item two</li>
</ul>
"""

print(extract_with_newlines(html))

Output:

Title
First paragraph.
Second paragraph.
Item one
Item two

Handling Encoding Issues

Specify encoding for proper character handling:

from bs4 import BeautifulSoup

# From bytes with encoding
html_bytes = b"<p>Caf\xe9</p>"
soup = BeautifulSoup(html_bytes, "lxml", from_encoding="latin-1")
print(soup.get_text())  # Café

# From file
with open("page.html", "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f, "lxml")
    text = soup.get_text(separator=" ", strip=True)

warning

Always specify encoding when reading HTML files or byte content. Incorrect encoding causes garbled text or extraction failures.

Parser Comparison

Parser	Speed	Lenient	Install
`lxml`	⚡ Fast	✅ Yes	`pip install lxml`
`html.parser`	Medium	✅ Yes	Built-in
`html5lib`	🐢 Slow	✅✅ Very	`pip install html5lib`

note

Use lxml for most cases. Fall back to html5lib for extremely malformed HTML that other parsers struggle with.

Summary

Use BeautifulSoup's get_text() with separator=" " and strip=True for clean extraction.
Always remove <script> and <style> tags first to avoid including code in your output.
For targeted extraction, use CSS selectors to focus on specific content areas.

1. Basic Extraction​

2. Clean Extraction (Recommended)​

3. Remove Script/Style Tags First​

4. Target Specific Elements​

5. Summary​

How to Extract Text from HTML in Python

Basic Text Extraction​

Preventing Text Concatenation​

Removing Script and Style Content​

Extracting from Specific Elements​

Complete Extraction Function​

Extracting Text with Structure​

Handling Encoding Issues​

Parser Comparison​

Summary​

Table of Contents

1. Basic Extraction

2. Clean Extraction (Recommended)

3. Remove Script/Style Tags First

4. Target Specific Elements

5. Summary

Basic Text Extraction

Preventing Text Concatenation

Removing Script and Style Content

Extracting from Specific Elements

Complete Extraction Function

Extracting Text with Structure

Handling Encoding Issues

Parser Comparison

Summary