Skip to main content

How to Resolve "ModuleNotFoundError: No module named 'pyspark'" in Python

When you begin working with Apache Spark in Python, the ModuleNotFoundError: No module named 'pyspark' is often the first error you'll encounter. It occurs because pyspark, the Python API for Spark, is not a built-in library and must be installed and configured correctly within your Python environment.

This guide will walk you through the most common and recommended solutions, starting with a simple pip installation for development, and then covering more advanced setups for connecting to an existing Spark installation.

Understanding the Error: PySpark as a Library vs. an Interface

The pyspark module can be used in two ways:

  1. As a self-contained library: When you install pyspark via pip, it includes the necessary Spark components to run a local Spark session. This is perfect for development, learning, and many data science tasks.
  2. As an interface: In a production or cluster environment, you might have a full Apache Spark installation on your system. Here, pyspark acts as an interface that allows your Python code to communicate with this existing Spark installation.

The ModuleNotFoundError means your Python interpreter cannot find the pyspark module via either of these methods.

For most users, especially those getting started or working on a local machine, the simplest solution is to install pyspark directly into your environment using pip or conda.

Example of code causing the error:

# This will fail if pyspark is not installed or configured
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()
print(spark)

Output:

Traceback (most recent call last):
File "main.py", line 2, in <module>
from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'

Solution: install the pyspark package.

pip install pyspark
note

Installation Commands for Different Environments

  • Explicit Python 3: python3 -m pip install pyspark
  • Anaconda: conda install pyspark or conda install -c conda-forge pyspark
  • Jupyter Notebook: !pip install pyspark

After the installation, your import statement will work correctly.

Solution 2: Use findspark to Locate an Existing Spark Installation

If you have downloaded and unzipped a full Apache Spark distribution but did not install pyspark via pip, your Python interpreter won't know where to find it. The findspark library solves this problem by locating your Spark installation and making it importable.

Step 1: Install findspark

pip install findspark

Step 2: Initialize findspark in your code before importing pyspark

import findspark
# This line finds the Spark installation and adds it to your path
findspark.init()

# Now you can import pyspark successfully
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FindsparkTest").getOrCreate()
print("SparkSession created successfully!")
print(spark)

Output:

SparkSession created successfully!
<pyspark.sql.session.SparkSession object at 0x...>

Solution 3: Manually Configure Environment Variables

This is the most advanced method and is equivalent to what findspark does automatically. If you have a manual Spark installation, you can tell your system where to find it by setting environment variables in your shell profile (e.g., .bashrc, .zshrc, or .bash_profile).

Solution:

  1. Identify the path to your Spark installation directory.
  2. Add the following lines to your shell profile file, replacing the path with your own.
# Set the path to your Spark installation
export SPARK_HOME="/path/to/your/spark-3.3.2-bin-hadoop3"

# Add the Spark bin directory to your PATH
export PATH="$SPARK_HOME/bin:$PATH"

# Add the PySpark libraries to your PYTHONPATH
export PYTHONPATH="$SPARK_HOME/python:$PYTHONPATH"

After adding these lines, you must restart your terminal or source your profile file (e.g., source ~/.zshrc) for the changes to take effect.

Troubleshooting: Environment and Interpreter Mismatches

If you have installed pyspark but still get the ModuleNotFoundError, you have an environment mismatch.

  • Problem: You installed pyspark using the system pip but are running your code inside a virtual environment (or vice versa).

  • Solution: Activate your virtual environment before running pip install pyspark.

  • Problem: Your IDE (like VS Code or PyCharm) is configured to use a different Python interpreter from the one you used in your terminal.

  • Solution: Configure your IDE's interpreter settings to point to the correct Python executable, especially the one inside your virtual environment.

Conclusion

Your SetupRecommended Solution
Local Development / Data ScienceInstall pyspark directly with a package manager: pip install pyspark.
Using an existing Spark installation from a script/notebookUse the findspark library to automatically locate and configure the path.
System-wide configuration for a manual Spark installationManually set the SPARK_HOME and PYTHONPATH environment variables.

For the vast majority of modern use cases, a simple pip install pyspark is the fastest and most effective way to resolve the ModuleNotFoundError and get started with Spark on your local machine.