Skip to main content

How to Convert HTML to Plain Text in Python

Extracting readable text from HTML is essential for web scraping, content indexing, NLP preprocessing, and data mining. This involves stripping tags while preserving meaningful structure and removing scripts, styles, and other non-content elements.

This guide covers the main approaches using BeautifulSoup, html2text, and built-in solutions.

BeautifulSoup's get_text() method strips all HTML tags and returns plain text:

pip install beautifulsoup4
from bs4 import BeautifulSoup

html = "<div><p>Hello</p><p>World</p></div>"
soup = BeautifulSoup(html, "html.parser")

# Default: concatenates text directly
print(soup.get_text())
# Output: HelloWorld (words merged - not ideal)

# With separator: adds space between elements
print(soup.get_text(separator=" "))
# Output: Hello World

# With strip: removes extra whitespace
print(soup.get_text(separator=" ", strip=True))
# Output: Hello World

Removing Scripts and Styles

Real web pages contain JavaScript, CSS, and other non-content elements that must be removed first:

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
<title>News Article</title>
<script>var tracking = true; analytics.init();</script>
<style>body { font-family: Arial; } .hidden { display: none; }</style>
</head>
<body>
<h1>Breaking News</h1>
<nav>Home | About | Contact</nav>
<p>Python 3.12 has been released with new features.</p>
<script>console.log('page loaded');</script>
<noscript>Enable JavaScript</noscript>
<footer>Copyright 2024</footer>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, "html.parser")

# Remove unwanted tags completely
for tag in soup(["script", "style", "meta", "noscript", "header", "footer", "nav"]):
tag.decompose()

# Extract clean text
text = soup.get_text(separator="\n", strip=True)
print(text)

Output:

News Article
Breaking News
Python 3.12 has been released with new features.
decompose() vs extract()

decompose() destroys the tag and its contents completely. extract() removes and returns the tag, allowing you to use it elsewhere. For simple text extraction, decompose() is more efficient.

Complete HTML Cleaning Function

from bs4 import BeautifulSoup
import re

def html_to_text(html, preserve_links=False):
"""
Convert HTML to clean plain text.

Args:
html: HTML string to convert
preserve_links: If True, append URLs in parentheses
"""
soup = BeautifulSoup(html, "html.parser")

# Remove non-content tags
for tag in soup(["script", "style", "meta", "noscript", "svg", "canvas"]):
tag.decompose()

# Optionally preserve link URLs
if preserve_links:
for a in soup.find_all('a', href=True):
a.string = f"{a.get_text()} ({a['href']})"

# Get text with line breaks between block elements
text = soup.get_text(separator="\n", strip=True)

# Normalize whitespace
text = re.sub(r'\n\s*\n', '\n\n', text) # Multiple newlines to double
text = re.sub(r' +', ' ', text) # Multiple spaces to single

return text.strip()

# Usage
html = """
<article>
<h1>Welcome</h1>
<p>Visit our <a href="https://example.com">website</a> for more info.</p>
<script>track();</script>
</article>
"""

print(html_to_text(html, preserve_links=True))

Output:

Welcome
Visit our
website (https://example.com)
for more info.

Using html2text (Markdown Output)

When you want to preserve formatting as Markdown:

pip install html2text
import html2text

converter = html2text.HTML2Text()
converter.ignore_links = False
converter.ignore_images = True
converter.body_width = 0 # Don't wrap lines

html = """
<h1>Python Guide</h1>
<p>Learn <b>Python</b> programming with our <i>comprehensive</i> tutorial.</p>
<ul>
<li>Variables</li>
<li>Functions</li>
<li>Classes</li>
</ul>
<p>Visit <a href="https://python.org">Python.org</a> for documentation.</p>
"""

markdown = converter.handle(html)
print(markdown)

Output:

# Python Guide

Learn **Python** programming with our _comprehensive_ tutorial.

* Variables
* Functions
* Classes



Visit [Python.org](https://python.org) for documentation.

Configuring html2text

import html2text

converter = html2text.HTML2Text()

# Common options
converter.ignore_links = True # Remove links entirely
converter.ignore_images = True # Remove image references
converter.ignore_emphasis = True # Remove bold/italic
converter.body_width = 80 # Wrap text at 80 chars (0 = no wrap)
converter.skip_internal_links = True
converter.inline_links = True # [text](url) vs [text][ref]

html = "<p>Check <b>this</b> <a href='#'>link</a></p>"
print(converter.handle(html))

Standard Library Solution

For simple cases without external dependencies:

from html.parser import HTMLParser
from html import unescape

class HTMLTextExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.text = []
self.skip_tags = {'script', 'style', 'noscript'}
self.current_skip = False

def handle_starttag(self, tag, attrs):
if tag in self.skip_tags:
self.current_skip = True

def handle_endtag(self, tag):
if tag in self.skip_tags:
self.current_skip = False

def handle_data(self, data):
if not self.current_skip:
self.text.append(data.strip())

def get_text(self):
return ' '.join(filter(None, self.text))

def html_to_text_stdlib(html_content):
"""Extract text using only standard library."""
parser = HTMLTextExtractor()
parser.feed(unescape(html_content))
return parser.get_text()

html = "<div><script>x=1;</script><p>Hello</p><p>World</p></div>"
print(html_to_text_stdlib(html))

Output:

Hello World
Standard Library Limitations

The built-in parser doesn't handle malformed HTML as gracefully as BeautifulSoup or lxml. Use external libraries for production web scraping.

Handling HTML Entities

from bs4 import BeautifulSoup
from html import unescape

html = "<p>Price: &pound;50 &amp; &euro;60 &mdash; great deal!</p>"

# BeautifulSoup handles entities automatically
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
# Output: Price: £50 & €60: great deal!

# Or use html.unescape directly
raw_text = "Cost: &lt;$100&gt; &amp; free shipping"
print(unescape(raw_text))
# Output: Cost: <$100> & free shipping

Output:

Price: £50 & €60 — great deal!
Cost: <$100> & free shipping

Summary

ToolOutput FormatBest For
BeautifulSoupPlain textNLP, search indexing, data extraction
html2textMarkdownBlog conversion, preserving structure
lxmlPlain textHigh-performance batch processing
Standard libraryPlain textSimple cases, no dependencies
RegexN/AAvoid - breaks on complex HTML
Best Practice

For most text extraction tasks, use BeautifulSoup.get_text(separator=' ', strip=True) after removing script/style tags. It handles malformed HTML gracefully and produces clean output suitable for NLP and analysis pipelines.