Skip to main content

How to Extract Text from HTML in Python

Use BeautifulSoup to strip tags and get clean, human-readable text.

1. Basic Extraction

from bs4 import BeautifulSoup

html = "<p>Hello <b>World</b>!</p>"
soup = BeautifulSoup(html, "lxml")

text = soup.get_text()
print(text) # "Hello World!"

Use separator and strip to prevent words from merging.

from bs4 import BeautifulSoup

html = "<div>Line 1</div><div>Line 2</div>"
soup = BeautifulSoup(html, "lxml")

text = soup.get_text(separator=" ", strip=True)
print(text) # "Line 1 Line 2"

3. Remove Script/Style Tags First

These tags contain non-human code. Delete them before extracting text.

for junk in soup(["script", "style"]):
junk.decompose() # Removes from tree

clean_text = soup.get_text()

4. Target Specific Elements

Use CSS selectors to get text from a specific section.

# Only get text from article body
for p in soup.select("article.body p"):
print(p.get_text())

5. Summary

MethodPurpose
get_text()All visible text
separator=" "Prevent words from joining
strip=TrueRemove whitespace
decompose()Delete unwanted tags

Conclusion: Use lxml as the parser (fast) and always remove <script>/<style> before extraction.

How to Extract Text from HTML in Python

BeautifulSoup provides reliable methods to strip HTML tags and extract clean, readable text from web pages and HTML documents.

Basic Text Extraction

The get_text() method removes all HTML tags and returns the text content:

from bs4 import BeautifulSoup

html = "<p>Hello <b>World</b>!</p>"
soup = BeautifulSoup(html, "lxml")

text = soup.get_text()
print(text) # Hello World!

Install the required packages:

pip install beautifulsoup4 lxml

Preventing Text Concatenation

Without a separator, adjacent elements merge their text:

from bs4 import BeautifulSoup

html = "<div>First</div><div>Second</div>"
soup = BeautifulSoup(html, "lxml")

# Without separator - words merge
print(soup.get_text()) # FirstSecond

# With separator - clean spacing
print(soup.get_text(separator=" ")) # First Second

# With strip to remove extra whitespace
print(soup.get_text(separator=" ", strip=True)) # First Second
tip

Always use separator=" " and strip=True for clean output. This handles nested tags, line breaks, and inconsistent whitespace in the source HTML.

Removing Script and Style Content

Script and style tags contain code, not readable text. Remove them before extraction:

from bs4 import BeautifulSoup

html = """
<html>
<head>
<style>body { color: red; }</style>
<script>alert('hello');</script>
</head>
<body>
<p>Actual content here.</p>
</body>
</html>
"""

soup = BeautifulSoup(html, "lxml")

# Remove unwanted tags
for element in soup(["script", "style", "meta", "link"]):
element.decompose()

text = soup.get_text(separator=" ", strip=True)
print(text) # Actual content here.

Extracting from Specific Elements

Use CSS selectors to target particular sections:

from bs4 import BeautifulSoup

html = """
<article>
<header>Article Title</header>
<div class="content">
<p>First paragraph.</p>
<p>Second paragraph.</p>
</div>
<footer>Author info</footer>
</article>
"""

soup = BeautifulSoup(html, "lxml")

# Get only content div text
content = soup.select_one(".content")
print(content.get_text(separator=" ", strip=True))
# First paragraph. Second paragraph.

# Iterate over specific elements
for p in soup.select(".content p"):
print(p.get_text())

Output:

First paragraph. Second paragraph.
First paragraph.
Second paragraph.

Complete Extraction Function

A reusable function for clean text extraction:

from bs4 import BeautifulSoup
import re

def extract_text(html, selector=None):
"""Extract clean text from HTML."""
soup = BeautifulSoup(html, "lxml")

# Remove non-content elements
for tag in soup(["script", "style", "meta", "link", "noscript"]):
tag.decompose()

# Target specific element if selector provided
if selector:
element = soup.select_one(selector)
if not element:
return ""
soup = element

# Extract and clean text
text = soup.get_text(separator=" ", strip=True)

# Normalize whitespace
text = re.sub(r'\s+', ' ', text)

return text.strip()

# Usage
html = "<div><p>Hello World</p><script>code();</script></div>"
print(extract_text(html)) # Hello World

Extracting Text with Structure

Preserve some structure by handling specific tags:

from bs4 import BeautifulSoup

def extract_with_newlines(html):
"""Extract text preserving paragraph breaks."""
soup = BeautifulSoup(html, "lxml")

# Remove scripts and styles
for tag in soup(["script", "style"]):
tag.decompose()

# Add newlines after block elements
for br in soup.find_all("br"):
br.replace_with("\n")

for tag in soup.find_all(["p", "div", "h1", "h2", "h3", "li"]):
tag.insert_after("\n")

text = soup.get_text()

# Clean up multiple newlines
lines = [line.strip() for line in text.splitlines()]
return "\n".join(line for line in lines if line)

html = """
<h1>Title</h1>
<p>First paragraph.</p>
<p>Second paragraph.</p>
<ul>
<li>Item one</li>
<li>Item two</li>
</ul>
"""

print(extract_with_newlines(html))

Output:

Title
First paragraph.
Second paragraph.
Item one
Item two

Handling Encoding Issues

Specify encoding for proper character handling:

from bs4 import BeautifulSoup

# From bytes with encoding
html_bytes = b"<p>Caf\xe9</p>"
soup = BeautifulSoup(html_bytes, "lxml", from_encoding="latin-1")
print(soup.get_text()) # Café

# From file
with open("page.html", "r", encoding="utf-8") as f:
soup = BeautifulSoup(f, "lxml")
text = soup.get_text(separator=" ", strip=True)
warning

Always specify encoding when reading HTML files or byte content. Incorrect encoding causes garbled text or extraction failures.

Parser Comparison

ParserSpeedLenientInstall
lxml⚡ Fast✅ Yespip install lxml
html.parserMedium✅ YesBuilt-in
html5lib🐢 Slow✅✅ Verypip install html5lib
note

Use lxml for most cases. Fall back to html5lib for extremely malformed HTML that other parsers struggle with.

Summary

  • Use BeautifulSoup's get_text() with separator=" " and strip=True for clean extraction.
  • Always remove <script> and <style> tags first to avoid including code in your output.
  • For targeted extraction, use CSS selectors to focus on specific content areas.