How to Convert HTML to Plain Text in Python

Extracting readable text from HTML is essential for web scraping, content indexing, NLP preprocessing, and data mining. This involves stripping tags while preserving meaningful structure and removing scripts, styles, and other non-content elements.

This guide covers the main approaches using BeautifulSoup, html2text, and built-in solutions.

Using BeautifulSoup (Recommended)

BeautifulSoup's get_text() method strips all HTML tags and returns plain text:

pip install beautifulsoup4

from bs4 import BeautifulSoup

html = "<div><p>Hello</p><p>World</p></div>"
soup = BeautifulSoup(html, "html.parser")

# Default: concatenates text directly
print(soup.get_text())
# Output: HelloWorld (words merged - not ideal)

# With separator: adds space between elements
print(soup.get_text(separator=" "))
# Output: Hello World

# With strip: removes extra whitespace
print(soup.get_text(separator=" ", strip=True))
# Output: Hello World

Removing Scripts and Styles

Real web pages contain JavaScript, CSS, and other non-content elements that must be removed first:

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>News Article</title>
    <script>var tracking = true; analytics.init();</script>
    <style>body { font-family: Arial; } .hidden { display: none; }</style>
</head>
<body>
    <h1>Breaking News</h1>
    <nav>Home | About | Contact</nav>
    <p>Python 3.12 has been released with new features.</p>
    <script>console.log('page loaded');</script>
    <noscript>Enable JavaScript</noscript>
    <footer>Copyright 2024</footer>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, "html.parser")

# Remove unwanted tags completely
for tag in soup(["script", "style", "meta", "noscript", "header", "footer", "nav"]):
    tag.decompose()

# Extract clean text
text = soup.get_text(separator="\n", strip=True)
print(text)

Output:

News Article
Breaking News
Python 3.12 has been released with new features.

decompose() vs extract()

decompose() destroys the tag and its contents completely. extract() removes and returns the tag, allowing you to use it elsewhere. For simple text extraction, decompose() is more efficient.

Complete HTML Cleaning Function

from bs4 import BeautifulSoup
import re

def html_to_text(html, preserve_links=False):
    """
    Convert HTML to clean plain text.
    
    Args:
        html: HTML string to convert
        preserve_links: If True, append URLs in parentheses
    """
    soup = BeautifulSoup(html, "html.parser")
    
    # Remove non-content tags
    for tag in soup(["script", "style", "meta", "noscript", "svg", "canvas"]):
        tag.decompose()
    
    # Optionally preserve link URLs
    if preserve_links:
        for a in soup.find_all('a', href=True):
            a.string = f"{a.get_text()} ({a['href']})"
    
    # Get text with line breaks between block elements
    text = soup.get_text(separator="\n", strip=True)
    
    # Normalize whitespace
    text = re.sub(r'\n\s*\n', '\n\n', text)   # Multiple newlines to double
    text = re.sub(r' +', ' ', text)           # Multiple spaces to single
    
    return text.strip()

# Usage
html = """
<article>
    <h1>Welcome</h1>
    <p>Visit our <a href="https://example.com">website</a> for more info.</p>
    <script>track();</script>
</article>
"""

print(html_to_text(html, preserve_links=True))

Output:

Welcome
Visit our
website (https://example.com)
for more info.

Using html2text (Markdown Output)

When you want to preserve formatting as Markdown:

pip install html2text

import html2text

converter = html2text.HTML2Text()
converter.ignore_links = False
converter.ignore_images = True
converter.body_width = 0  # Don't wrap lines

html = """
<h1>Python Guide</h1>
<p>Learn <b>Python</b> programming with our <i>comprehensive</i> tutorial.</p>
<ul>
    <li>Variables</li>
    <li>Functions</li>
    <li>Classes</li>
</ul>
<p>Visit <a href="https://python.org">Python.org</a> for documentation.</p>
"""

markdown = converter.handle(html)
print(markdown)

Output:

# Python Guide

Learn **Python** programming with our _comprehensive_ tutorial.

  * Variables
  * Functions
  * Classes

Visit [Python.org](https://python.org) for documentation.

Configuring html2text

import html2text

converter = html2text.HTML2Text()

# Common options
converter.ignore_links = True      # Remove links entirely
converter.ignore_images = True     # Remove image references
converter.ignore_emphasis = True   # Remove bold/italic
converter.body_width = 80          # Wrap text at 80 chars (0 = no wrap)
converter.skip_internal_links = True
converter.inline_links = True      # [text](url) vs [text][ref]

html = "<p>Check <b>this</b> <a href='#'>link</a></p>"
print(converter.handle(html))

Standard Library Solution

For simple cases without external dependencies:

from html.parser import HTMLParser
from html import unescape

class HTMLTextExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.text = []
        self.skip_tags = {'script', 'style', 'noscript'}
        self.current_skip = False
    
    def handle_starttag(self, tag, attrs):
        if tag in self.skip_tags:
            self.current_skip = True
    
    def handle_endtag(self, tag):
        if tag in self.skip_tags:
            self.current_skip = False
    
    def handle_data(self, data):
        if not self.current_skip:
            self.text.append(data.strip())
    
    def get_text(self):
        return ' '.join(filter(None, self.text))

def html_to_text_stdlib(html_content):
    """Extract text using only standard library."""
    parser = HTMLTextExtractor()
    parser.feed(unescape(html_content))
    return parser.get_text()

html = "<div><script>x=1;</script><p>Hello</p><p>World</p></div>"
print(html_to_text_stdlib(html))

Output:

Hello World

Standard Library Limitations

The built-in parser doesn't handle malformed HTML as gracefully as BeautifulSoup or lxml. Use external libraries for production web scraping.

Handling HTML Entities

from bs4 import BeautifulSoup
from html import unescape

html = "<p>Price: &pound;50 &amp; &euro;60 &mdash; great deal!</p>"

# BeautifulSoup handles entities automatically
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
# Output: Price: £50 & €60: great deal!

# Or use html.unescape directly
raw_text = "Cost: &lt;$100&gt; &amp; free shipping"
print(unescape(raw_text))
# Output: Cost: <$100> & free shipping

Output:

Price: £50 & €60 — great deal!
Cost: <$100> & free shipping

Summary

Tool	Output Format	Best For
BeautifulSoup	Plain text	NLP, search indexing, data extraction
html2text	Markdown	Blog conversion, preserving structure
lxml	Plain text	High-performance batch processing
Standard library	Plain text	Simple cases, no dependencies
Regex	N/A	Avoid - breaks on complex HTML

Best Practice

For most text extraction tasks, use BeautifulSoup.get_text(separator=' ', strip=True) after removing script/style tags. It handles malformed HTML gracefully and produces clean output suitable for NLP and analysis pipelines.

Using BeautifulSoup (Recommended)​

Removing Scripts and Styles​

Complete HTML Cleaning Function​

Using html2text (Markdown Output)​

Configuring html2text​

Standard Library Solution​

Handling HTML Entities​

Summary​

Table of Contents

Using BeautifulSoup (Recommended)

Removing Scripts and Styles

Complete HTML Cleaning Function

Using html2text (Markdown Output)

Configuring html2text

Standard Library Solution

Handling HTML Entities

Summary