Skip to main content

How to Resolve "Could Not Import pypandoc - Required to Package PySpark" in Python

When packaging or distributing a PySpark application, you may run into the error "Could not import pypandoc - required to package PySpark". This error stops the build process and can be confusing, especially since pypandoc isn't something you'd normally associate with Spark.

In this guide, we'll explain why this error happens, walk through multiple solutions from simplest to most comprehensive, and help you verify that everything is working correctly.

Why Does This Error Occur?

PySpark's setup.py uses pypandoc to convert the project's README.md file from Markdown to reStructuredText (RST) format during the packaging process. This conversion is needed because PyPI historically required package descriptions in RST format.

When you run a packaging command like:

python setup.py sdist

PySpark's setup script attempts to import pypandoc. If the library isn't installed, or if pypandoc is installed but can't find the underlying pandoc binary, the process fails with:

Could not import pypandoc - required to package PySpark
info

This error only affects packaging and distribution. If you're simply using PySpark as an end user (e.g., pip install pyspark), you typically won't encounter this issue. It primarily affects developers building PySpark from source or creating custom distributions.

Understanding the Dependency Chain

The key thing to understand is that there are two separate components that must both be present:

ComponentWhat It IsHow to Install
pypandocA Python wrapper librarypip install pypandoc
pandocThe actual document converter binarySystem package manager or manual download

pypandoc without pandoc will still raise errors. Both must be installed and accessible.

Solution 1: Install pypandoc via pip

The simplest first step is to install the Python package:

pip install pypandoc

After installation, verify it's available:

import pypandoc
print(pypandoc.__version__)

However, this alone may not be enough. If pandoc isn't installed on your system, you'll see a different error when pypandoc tries to use it:

OSError: No pandoc was found: either install pandoc and add it to your PATH or install pypandoc-binary.

Solution 2: Install pypandoc-binary (Easiest Complete Fix)

The pypandoc-binary package bundles the pandoc binary together with the Python wrapper, eliminating the need for a separate system installation:

pip install pypandoc-binary

This is the most straightforward solution because it handles both dependencies in a single command, with no system-level package manager required.

Recommended Approach

If you just need to get past this error quickly, pip install pypandoc-binary is the fastest and most reliable fix across all platforms.

Solution 3: Install pandoc Separately

If you prefer to install pandoc at the system level (or if pypandoc-binary doesn't support your platform), install pandoc using your operating system's package manager.

On Ubuntu/Debian:

sudo apt-get update
sudo apt-get install pandoc

On macOS (using Homebrew):

brew install pandoc

On Windows:

Download the installer from the official Pandoc releases page and run it, or use Chocolatey:

choco install pandoc

After installation, verify that pandoc is accessible from the command line:

pandoc --version

Expected output:

pandoc 3.x.x
Compiled with pandoc-types ...

Solution 4: Use Conda

If you're using a Conda environment, you can install both components together:

conda install -c conda-forge pypandoc pandoc

This ensures version compatibility within the Conda ecosystem.

Solution 5: Fix PATH Issues

If pypandoc is installed but still can't find pandoc, the binary might not be in your system's PATH.

On macOS/Linux, add the path to your shell configuration file:

# Add to ~/.bashrc, ~/.zshrc, or ~/.profile
export PATH=$PATH:/usr/local/bin

# Reload the configuration
source ~/.bashrc # or source ~/.zshrc

On Windows:

  1. Open System PropertiesAdvancedEnvironment Variables.
  2. Under System variables, find and edit the PATH variable.
  3. Add the directory containing pandoc.exe (e.g., C:\Program Files\Pandoc).
  4. Restart your terminal.

Verifying the Fix

After applying any of the solutions above, run this verification script to confirm everything works:

import sys

# Step 1: Verify pypandoc import
try:
import pypandoc
print(f"✅ pypandoc {pypandoc.__version__} is installed.")
except ImportError:
print("❌ pypandoc is NOT installed.")
sys.exit(1)

# Step 2: Verify pandoc binary is accessible
try:
pandoc_version = pypandoc.get_pandoc_version()
print(f"✅ pandoc {pandoc_version} is available.")
except OSError:
print("❌ pandoc binary is NOT found.")
sys.exit(1)

# Step 3: Test a conversion
try:
output = pypandoc.convert_text('# Hello World', 'rst', format='md')
print("✅ pypandoc conversion works correctly.")
print(f" Output: {output.strip()}")
except Exception as e:
print(f"❌ Conversion failed: {e}")

Expected output:

✅ pypandoc 1.13 is installed.
✅ pandoc 3.1.9 is available.
✅ pypandoc conversion works correctly.
Output: Hello World
===========

Alternative: Bypass pypandoc Entirely

If you're building a custom PySpark distribution and don't need the README conversion, you can modify setup.py to skip the pypandoc dependency:

try:
import pypandoc
long_description = pypandoc.convert_file('README.md', 'rst')
except (ImportError, OSError):
long_description = open('README.md').read()
caution

This workaround is only appropriate for local or internal builds. If you're publishing to PyPI, the description format matters and you should install pypandoc properly.

Troubleshooting

ProblemSolution
pip install pypandoc succeeds but error persistsInstall pandoc binary separately or use pypandoc-binary
pandoc --version works but Python can't find itCheck that the Python environment's PATH matches your shell's PATH
Version conflicts in CondaRun conda update --all to resolve dependency conflicts
Error on a CI/CD pipelineAdd pip install pypandoc-binary to your build steps
Python version incompatibilityCheck the pypandoc PyPI page for supported Python versions

Conclusion

The "Could not import pypandoc - required to package PySpark" error occurs because PySpark's packaging process depends on pypandoc (and its underlying pandoc binary) to convert documentation formats.

The quickest fix is to install pypandoc-binary with pip install pypandoc-binary, which bundles everything you need. Alternatively, you can install pypandoc and pandoc separately using your system's package manager. After applying any fix, always run a quick verification to confirm both components are accessible before retrying your PySpark build.