How to Extract Text from HTML in Python
Use BeautifulSoup to strip tags and get clean, human-readable text.
1. Basic Extraction
from bs4 import BeautifulSoup
html = "<p>Hello <b>World</b>!</p>"
soup = BeautifulSoup(html, "lxml")
text = soup.get_text()
print(text) # "Hello World!"
2. Clean Extraction (Recommended)
Use separator and strip to prevent words from merging.
from bs4 import BeautifulSoup
html = "<div>Line 1</div><div>Line 2</div>"
soup = BeautifulSoup(html, "lxml")
text = soup.get_text(separator=" ", strip=True)
print(text) # "Line 1 Line 2"
3. Remove Script/Style Tags First
These tags contain non-human code. Delete them before extracting text.
for junk in soup(["script", "style"]):
junk.decompose() # Removes from tree
clean_text = soup.get_text()
4. Target Specific Elements
Use CSS selectors to get text from a specific section.
# Only get text from article body
for p in soup.select("article.body p"):
print(p.get_text())
5. Summary
| Method | Purpose |
|---|---|
get_text() | All visible text |
separator=" " | Prevent words from joining |
strip=True | Remove whitespace |
decompose() | Delete unwanted tags |
Conclusion: Use lxml as the parser (fast) and always remove <script>/<style> before extraction.
How to Extract Text from HTML in Python
BeautifulSoup provides reliable methods to strip HTML tags and extract clean, readable text from web pages and HTML documents.
Basic Text Extraction
The get_text() method removes all HTML tags and returns the text content:
from bs4 import BeautifulSoup
html = "<p>Hello <b>World</b>!</p>"
soup = BeautifulSoup(html, "lxml")
text = soup.get_text()
print(text) # Hello World!
Install the required packages:
pip install beautifulsoup4 lxml
Preventing Text Concatenation
Without a separator, adjacent elements merge their text:
from bs4 import BeautifulSoup
html = "<div>First</div><div>Second</div>"
soup = BeautifulSoup(html, "lxml")
# Without separator - words merge
print(soup.get_text()) # FirstSecond
# With separator - clean spacing
print(soup.get_text(separator=" ")) # First Second
# With strip to remove extra whitespace
print(soup.get_text(separator=" ", strip=True)) # First Second
Always use separator=" " and strip=True for clean output. This handles nested tags, line breaks, and inconsistent whitespace in the source HTML.
Removing Script and Style Content
Script and style tags contain code, not readable text. Remove them before extraction:
from bs4 import BeautifulSoup
html = """
<html>
<head>
<style>body { color: red; }</style>
<script>alert('hello');</script>
</head>
<body>
<p>Actual content here.</p>
</body>
</html>
"""
soup = BeautifulSoup(html, "lxml")
# Remove unwanted tags
for element in soup(["script", "style", "meta", "link"]):
element.decompose()
text = soup.get_text(separator=" ", strip=True)
print(text) # Actual content here.
Extracting from Specific Elements
Use CSS selectors to target particular sections:
from bs4 import BeautifulSoup
html = """
<article>
<header>Article Title</header>
<div class="content">
<p>First paragraph.</p>
<p>Second paragraph.</p>
</div>
<footer>Author info</footer>
</article>
"""
soup = BeautifulSoup(html, "lxml")
# Get only content div text
content = soup.select_one(".content")
print(content.get_text(separator=" ", strip=True))
# First paragraph. Second paragraph.
# Iterate over specific elements
for p in soup.select(".content p"):
print(p.get_text())
Output:
First paragraph. Second paragraph.
First paragraph.
Second paragraph.
Complete Extraction Function
A reusable function for clean text extraction:
from bs4 import BeautifulSoup
import re
def extract_text(html, selector=None):
"""Extract clean text from HTML."""
soup = BeautifulSoup(html, "lxml")
# Remove non-content elements
for tag in soup(["script", "style", "meta", "link", "noscript"]):
tag.decompose()
# Target specific element if selector provided
if selector:
element = soup.select_one(selector)
if not element:
return ""
soup = element
# Extract and clean text
text = soup.get_text(separator=" ", strip=True)
# Normalize whitespace
text = re.sub(r'\s+', ' ', text)
return text.strip()
# Usage
html = "<div><p>Hello World</p><script>code();</script></div>"
print(extract_text(html)) # Hello World
Extracting Text with Structure
Preserve some structure by handling specific tags:
from bs4 import BeautifulSoup
def extract_with_newlines(html):
"""Extract text preserving paragraph breaks."""
soup = BeautifulSoup(html, "lxml")
# Remove scripts and styles
for tag in soup(["script", "style"]):
tag.decompose()
# Add newlines after block elements
for br in soup.find_all("br"):
br.replace_with("\n")
for tag in soup.find_all(["p", "div", "h1", "h2", "h3", "li"]):
tag.insert_after("\n")
text = soup.get_text()
# Clean up multiple newlines
lines = [line.strip() for line in text.splitlines()]
return "\n".join(line for line in lines if line)
html = """
<h1>Title</h1>
<p>First paragraph.</p>
<p>Second paragraph.</p>
<ul>
<li>Item one</li>
<li>Item two</li>
</ul>
"""
print(extract_with_newlines(html))
Output:
Title
First paragraph.
Second paragraph.
Item one
Item two
Handling Encoding Issues
Specify encoding for proper character handling:
from bs4 import BeautifulSoup
# From bytes with encoding
html_bytes = b"<p>Caf\xe9</p>"
soup = BeautifulSoup(html_bytes, "lxml", from_encoding="latin-1")
print(soup.get_text()) # Café
# From file
with open("page.html", "r", encoding="utf-8") as f:
soup = BeautifulSoup(f, "lxml")
text = soup.get_text(separator=" ", strip=True)
Always specify encoding when reading HTML files or byte content. Incorrect encoding causes garbled text or extraction failures.
Parser Comparison
| Parser | Speed | Lenient | Install |
|---|---|---|---|
lxml | ⚡ Fast | ✅ Yes | pip install lxml |
html.parser | Medium | ✅ Yes | Built-in |
html5lib | 🐢 Slow | ✅✅ Very | pip install html5lib |
Use lxml for most cases. Fall back to html5lib for extremely malformed HTML that other parsers struggle with.
Summary
- Use BeautifulSoup's
get_text()withseparator=" "andstrip=Truefor clean extraction. - Always remove
<script>and<style>tags first to avoid including code in your output. - For targeted extraction, use CSS selectors to focus on specific content areas.