How to Resolve "ModuleNotFoundError: No module named 'pyspark'" in Python
When you begin working with Apache Spark in Python, the ModuleNotFoundError: No module named 'pyspark' is often the first error you'll encounter. It occurs because pyspark, the Python API for Spark, is not a built-in library and must be installed and configured correctly within your Python environment.
This guide will walk you through the most common and recommended solutions, starting with a simple pip installation for development, and then covering more advanced setups for connecting to an existing Spark installation.
Understanding the Error: PySpark as a Library vs. an Interface
The pyspark module can be used in two ways:
- As a self-contained library: When you install
pysparkviapip, it includes the necessary Spark components to run a local Spark session. This is perfect for development, learning, and many data science tasks. - As an interface: In a production or cluster environment, you might have a full Apache Spark installation on your system. Here,
pysparkacts as an interface that allows your Python code to communicate with this existing Spark installation.
The ModuleNotFoundError means your Python interpreter cannot find the pyspark module via either of these methods.
Solution 1: Install pyspark with a Package Manager (Recommended)
For most users, especially those getting started or working on a local machine, the simplest solution is to install pyspark directly into your environment using pip or conda.
Example of code causing the error:
# This will fail if pyspark is not installed or configured
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
print(spark)
Output:
Traceback (most recent call last):
File "main.py", line 2, in <module>
from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'
Solution: install the pyspark package.
pip install pyspark
Installation Commands for Different Environments
- Explicit Python 3:
python3 -m pip install pyspark - Anaconda:
conda install pysparkorconda install -c conda-forge pyspark - Jupyter Notebook:
!pip install pyspark
After the installation, your import statement will work correctly.