How to Resolve "ModuleNotFoundError: No module named 'pyspark'" in Python
When you begin working with Apache Spark in Python, the ModuleNotFoundError: No module named 'pyspark' is often the first error you'll encounter. It occurs because pyspark, the Python API for Spark, is not a built-in library and must be installed and configured correctly within your Python environment.
This guide will walk you through the most common and recommended solutions, starting with a simple pip installation for development, and then covering more advanced setups for connecting to an existing Spark installation.
Understanding the Error: PySpark as a Library vs. an Interface
The pyspark module can be used in two ways:
- As a self-contained library: When you install
pysparkviapip, it includes the necessary Spark components to run a local Spark session. This is perfect for development, learning, and many data science tasks. - As an interface: In a production or cluster environment, you might have a full Apache Spark installation on your system. Here,
pysparkacts as an interface that allows your Python code to communicate with this existing Spark installation.
The ModuleNotFoundError means your Python interpreter cannot find the pyspark module via either of these methods.
Solution 1: Install pyspark with a Package Manager (Recommended)
For most users, especially those getting started or working on a local machine, the simplest solution is to install pyspark directly into your environment using pip or conda.
Example of code causing the error:
# This will fail if pyspark is not installed or configured
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
print(spark)
Output:
Traceback (most recent call last):
File "main.py", line 2, in <module>
from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'
Solution: install the pyspark package.
pip install pyspark
Installation Commands for Different Environments
- Explicit Python 3:
python3 -m pip install pyspark - Anaconda:
conda install pysparkorconda install -c conda-forge pyspark - Jupyter Notebook:
!pip install pyspark
After the installation, your import statement will work correctly.
Solution 2: Use findspark to Locate an Existing Spark Installation
If you have downloaded and unzipped a full Apache Spark distribution but did not install pyspark via pip, your Python interpreter won't know where to find it. The findspark library solves this problem by locating your Spark installation and making it importable.
Step 1: Install findspark
pip install findspark
Step 2: Initialize findspark in your code before importing pyspark
import findspark
# This line finds the Spark installation and adds it to your path
findspark.init()
# Now you can import pyspark successfully
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FindsparkTest").getOrCreate()
print("SparkSession created successfully!")
print(spark)
Output:
SparkSession created successfully!
<pyspark.sql.session.SparkSession object at 0x...>
Solution 3: Manually Configure Environment Variables
This is the most advanced method and is equivalent to what findspark does automatically. If you have a manual Spark installation, you can tell your system where to find it by setting environment variables in your shell profile (e.g., .bashrc, .zshrc, or .bash_profile).
Solution:
- Identify the path to your Spark installation directory.
- Add the following lines to your shell profile file, replacing the path with your own.
# Set the path to your Spark installation
export SPARK_HOME="/path/to/your/spark-3.3.2-bin-hadoop3"
# Add the Spark bin directory to your PATH
export PATH="$SPARK_HOME/bin:$PATH"
# Add the PySpark libraries to your PYTHONPATH
export PYTHONPATH="$SPARK_HOME/python:$PYTHONPATH"
After adding these lines, you must restart your terminal or source your profile file (e.g., source ~/.zshrc) for the changes to take effect.
Troubleshooting: Environment and Interpreter Mismatches
If you have installed pyspark but still get the ModuleNotFoundError, you have an environment mismatch.
-
Problem: You installed
pysparkusing the systempipbut are running your code inside a virtual environment (or vice versa). -
Solution: Activate your virtual environment before running
pip install pyspark. -
Problem: Your IDE (like VS Code or PyCharm) is configured to use a different Python interpreter from the one you used in your terminal.
-
Solution: Configure your IDE's interpreter settings to point to the correct Python executable, especially the one inside your virtual environment.
Conclusion
| Your Setup | Recommended Solution |
|---|---|
| Local Development / Data Science | Install pyspark directly with a package manager: pip install pyspark. |
| Using an existing Spark installation from a script/notebook | Use the findspark library to automatically locate and configure the path. |
| System-wide configuration for a manual Spark installation | Manually set the SPARK_HOME and PYTHONPATH environment variables. |
For the vast majority of modern use cases, a simple pip install pyspark is the fastest and most effective way to resolve the ModuleNotFoundError and get started with Spark on your local machine.