How to Convert HTML to Plain Text in Python
Extracting readable text from HTML is essential for web scraping, content indexing, NLP preprocessing, and data mining. This involves stripping tags while preserving meaningful structure and removing scripts, styles, and other non-content elements.
This guide covers the main approaches using BeautifulSoup, html2text, and built-in solutions.
Using BeautifulSoup (Recommended)
BeautifulSoup's get_text() method strips all HTML tags and returns plain text:
pip install beautifulsoup4
from bs4 import BeautifulSoup
html = "<div><p>Hello</p><p>World</p></div>"
soup = BeautifulSoup(html, "html.parser")
# Default: concatenates text directly
print(soup.get_text())
# Output: HelloWorld (words merged - not ideal)
# With separator: adds space between elements
print(soup.get_text(separator=" "))
# Output: Hello World
# With strip: removes extra whitespace
print(soup.get_text(separator=" ", strip=True))
# Output: Hello World
Removing Scripts and Styles
Real web pages contain JavaScript, CSS, and other non-content elements that must be removed first:
from bs4 import BeautifulSoup
html_doc = """
<!DOCTYPE html>
<html>
<head>
<title>News Article</title>
<script>var tracking = true; analytics.init();</script>
<style>body { font-family: Arial; } .hidden { display: none; }</style>
</head>
<body>
<h1>Breaking News</h1>
<nav>Home | About | Contact</nav>
<p>Python 3.12 has been released with new features.</p>
<script>console.log('page loaded');</script>
<noscript>Enable JavaScript</noscript>
<footer>Copyright 2024</footer>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, "html.parser")
# Remove unwanted tags completely
for tag in soup(["script", "style", "meta", "noscript", "header", "footer", "nav"]):
tag.decompose()
# Extract clean text
text = soup.get_text(separator="\n", strip=True)
print(text)
Output:
News Article
Breaking News
Python 3.12 has been released with new features.
decompose() destroys the tag and its contents completely. extract() removes and returns the tag, allowing you to use it elsewhere. For simple text extraction, decompose() is more efficient.
Complete HTML Cleaning Function
from bs4 import BeautifulSoup
import re
def html_to_text(html, preserve_links=False):
"""
Convert HTML to clean plain text.
Args:
html: HTML string to convert
preserve_links: If True, append URLs in parentheses
"""
soup = BeautifulSoup(html, "html.parser")
# Remove non-content tags
for tag in soup(["script", "style", "meta", "noscript", "svg", "canvas"]):
tag.decompose()
# Optionally preserve link URLs
if preserve_links:
for a in soup.find_all('a', href=True):
a.string = f"{a.get_text()} ({a['href']})"
# Get text with line breaks between block elements
text = soup.get_text(separator="\n", strip=True)
# Normalize whitespace
text = re.sub(r'\n\s*\n', '\n\n', text) # Multiple newlines to double
text = re.sub(r' +', ' ', text) # Multiple spaces to single
return text.strip()
# Usage
html = """
<article>
<h1>Welcome</h1>
<p>Visit our <a href="https://example.com">website</a> for more info.</p>
<script>track();</script>
</article>
"""
print(html_to_text(html, preserve_links=True))
Output:
Welcome
Visit our
website (https://example.com)
for more info.
Using html2text (Markdown Output)
When you want to preserve formatting as Markdown:
pip install html2text
import html2text
converter = html2text.HTML2Text()
converter.ignore_links = False
converter.ignore_images = True
converter.body_width = 0 # Don't wrap lines
html = """
<h1>Python Guide</h1>
<p>Learn <b>Python</b> programming with our <i>comprehensive</i> tutorial.</p>
<ul>
<li>Variables</li>
<li>Functions</li>
<li>Classes</li>
</ul>
<p>Visit <a href="https://python.org">Python.org</a> for documentation.</p>
"""
markdown = converter.handle(html)
print(markdown)
Output:
# Python Guide
Learn **Python** programming with our _comprehensive_ tutorial.
* Variables
* Functions
* Classes
Visit [Python.org](https://python.org) for documentation.
Configuring html2text
import html2text
converter = html2text.HTML2Text()
# Common options
converter.ignore_links = True # Remove links entirely
converter.ignore_images = True # Remove image references
converter.ignore_emphasis = True # Remove bold/italic
converter.body_width = 80 # Wrap text at 80 chars (0 = no wrap)
converter.skip_internal_links = True
converter.inline_links = True # [text](url) vs [text][ref]
html = "<p>Check <b>this</b> <a href='#'>link</a></p>"
print(converter.handle(html))
Standard Library Solution
For simple cases without external dependencies:
from html.parser import HTMLParser
from html import unescape
class HTMLTextExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.text = []
self.skip_tags = {'script', 'style', 'noscript'}
self.current_skip = False
def handle_starttag(self, tag, attrs):
if tag in self.skip_tags:
self.current_skip = True
def handle_endtag(self, tag):
if tag in self.skip_tags:
self.current_skip = False
def handle_data(self, data):
if not self.current_skip:
self.text.append(data.strip())
def get_text(self):
return ' '.join(filter(None, self.text))
def html_to_text_stdlib(html_content):
"""Extract text using only standard library."""
parser = HTMLTextExtractor()
parser.feed(unescape(html_content))
return parser.get_text()
html = "<div><script>x=1;</script><p>Hello</p><p>World</p></div>"
print(html_to_text_stdlib(html))
Output:
Hello World
The built-in parser doesn't handle malformed HTML as gracefully as BeautifulSoup or lxml. Use external libraries for production web scraping.
Handling HTML Entities
from bs4 import BeautifulSoup
from html import unescape
html = "<p>Price: £50 & €60 — great deal!</p>"
# BeautifulSoup handles entities automatically
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
# Output: Price: £50 & €60: great deal!
# Or use html.unescape directly
raw_text = "Cost: <$100> & free shipping"
print(unescape(raw_text))
# Output: Cost: <$100> & free shipping
Output:
Price: £50 & €60 — great deal!
Cost: <$100> & free shipping
Summary
| Tool | Output Format | Best For |
|---|---|---|
| BeautifulSoup | Plain text | NLP, search indexing, data extraction |
| html2text | Markdown | Blog conversion, preserving structure |
| lxml | Plain text | High-performance batch processing |
| Standard library | Plain text | Simple cases, no dependencies |
| Regex | N/A | Avoid - breaks on complex HTML |
For most text extraction tasks, use BeautifulSoup.get_text(separator=' ', strip=True) after removing script/style tags. It handles malformed HTML gracefully and produces clean output suitable for NLP and analysis pipelines.