How to Resolve "Could Not Import pypandoc - Required to Package PySpark" in Python
When packaging or distributing a PySpark application, you may run into the error "Could not import pypandoc - required to package PySpark". This error stops the build process and can be confusing, especially since pypandoc isn't something you'd normally associate with Spark.
In this guide, we'll explain why this error happens, walk through multiple solutions from simplest to most comprehensive, and help you verify that everything is working correctly.
Why Does This Error Occur?
PySpark's setup.py uses pypandoc to convert the project's README.md file from Markdown to reStructuredText (RST) format during the packaging process. This conversion is needed because PyPI historically required package descriptions in RST format.
When you run a packaging command like:
python setup.py sdist
PySpark's setup script attempts to import pypandoc. If the library isn't installed, or if pypandoc is installed but can't find the underlying pandoc binary, the process fails with:
Could not import pypandoc - required to package PySpark
This error only affects packaging and distribution. If you're simply using PySpark as an end user (e.g., pip install pyspark), you typically won't encounter this issue. It primarily affects developers building PySpark from source or creating custom distributions.
Understanding the Dependency Chain
The key thing to understand is that there are two separate components that must both be present:
| Component | What It Is | How to Install |
|---|---|---|
pypandoc | A Python wrapper library | pip install pypandoc |
pandoc | The actual document converter binary | System package manager or manual download |
pypandoc without pandoc will still raise errors. Both must be installed and accessible.
Solution 1: Install pypandoc via pip
The simplest first step is to install the Python package:
pip install pypandoc
After installation, verify it's available:
import pypandoc
print(pypandoc.__version__)
However, this alone may not be enough. If pandoc isn't installed on your system, you'll see a different error when pypandoc tries to use it:
OSError: No pandoc was found: either install pandoc and add it to your PATH or install pypandoc-binary.
Solution 2: Install pypandoc-binary (Easiest Complete Fix)
The pypandoc-binary package bundles the pandoc binary together with the Python wrapper, eliminating the need for a separate system installation:
pip install pypandoc-binary
This is the most straightforward solution because it handles both dependencies in a single command, with no system-level package manager required.
If you just need to get past this error quickly, pip install pypandoc-binary is the fastest and most reliable fix across all platforms.
Solution 3: Install pandoc Separately
If you prefer to install pandoc at the system level (or if pypandoc-binary doesn't support your platform), install pandoc using your operating system's package manager.
On Ubuntu/Debian:
sudo apt-get update
sudo apt-get install pandoc
On macOS (using Homebrew):
brew install pandoc
On Windows:
Download the installer from the official Pandoc releases page and run it, or use Chocolatey:
choco install pandoc
After installation, verify that pandoc is accessible from the command line:
pandoc --version
Expected output:
pandoc 3.x.x
Compiled with pandoc-types ...
Solution 4: Use Conda
If you're using a Conda environment, you can install both components together:
conda install -c conda-forge pypandoc pandoc
This ensures version compatibility within the Conda ecosystem.
Solution 5: Fix PATH Issues
If pypandoc is installed but still can't find pandoc, the binary might not be in your system's PATH.
On macOS/Linux, add the path to your shell configuration file:
# Add to ~/.bashrc, ~/.zshrc, or ~/.profile
export PATH=$PATH:/usr/local/bin
# Reload the configuration
source ~/.bashrc # or source ~/.zshrc
On Windows:
- Open System Properties → Advanced → Environment Variables.
- Under System variables, find and edit the
PATHvariable. - Add the directory containing
pandoc.exe(e.g.,C:\Program Files\Pandoc). - Restart your terminal.
Verifying the Fix
After applying any of the solutions above, run this verification script to confirm everything works:
import sys
# Step 1: Verify pypandoc import
try:
import pypandoc
print(f"✅ pypandoc {pypandoc.__version__} is installed.")
except ImportError:
print("❌ pypandoc is NOT installed.")
sys.exit(1)
# Step 2: Verify pandoc binary is accessible
try:
pandoc_version = pypandoc.get_pandoc_version()
print(f"✅ pandoc {pandoc_version} is available.")
except OSError:
print("❌ pandoc binary is NOT found.")
sys.exit(1)
# Step 3: Test a conversion
try:
output = pypandoc.convert_text('# Hello World', 'rst', format='md')
print("✅ pypandoc conversion works correctly.")
print(f" Output: {output.strip()}")
except Exception as e:
print(f"❌ Conversion failed: {e}")
Expected output:
✅ pypandoc 1.13 is installed.
✅ pandoc 3.1.9 is available.
✅ pypandoc conversion works correctly.
Output: Hello World
===========
Alternative: Bypass pypandoc Entirely
If you're building a custom PySpark distribution and don't need the README conversion, you can modify setup.py to skip the pypandoc dependency:
try:
import pypandoc
long_description = pypandoc.convert_file('README.md', 'rst')
except (ImportError, OSError):
long_description = open('README.md').read()
This workaround is only appropriate for local or internal builds. If you're publishing to PyPI, the description format matters and you should install pypandoc properly.
Troubleshooting
| Problem | Solution |
|---|---|
pip install pypandoc succeeds but error persists | Install pandoc binary separately or use pypandoc-binary |
pandoc --version works but Python can't find it | Check that the Python environment's PATH matches your shell's PATH |
| Version conflicts in Conda | Run conda update --all to resolve dependency conflicts |
| Error on a CI/CD pipeline | Add pip install pypandoc-binary to your build steps |
| Python version incompatibility | Check the pypandoc PyPI page for supported Python versions |
Conclusion
The "Could not import pypandoc - required to package PySpark" error occurs because PySpark's packaging process depends on pypandoc (and its underlying pandoc binary) to convert documentation formats.
The quickest fix is to install pypandoc-binary with pip install pypandoc-binary, which bundles everything you need. Alternatively, you can install pypandoc and pandoc separately using your system's package manager. After applying any fix, always run a quick verification to confirm both components are accessible before retrying your PySpark build.