Python PySpark: How to Count Distinct Values in a PySpark DataFrame

Counting distinct (unique) values is a fundamental operation in data analysis - whether you need to find the number of unique customers, deduplicate records, or validate data quality. PySpark provides multiple ways to count distinct values across an entire DataFrame or within specific columns.

In this guide, you will learn three approaches to count distinct values in PySpark: using distinct().count(), the countDistinct() function, and SQL queries. Each method is demonstrated with practical examples and clear outputs.

Method 1: Using `distinct().count()`

The most straightforward approach chains two DataFrame methods:

distinct() removes duplicate rows by comparing all columns.
count() returns the total number of remaining rows.

This gives you the count of unique rows across the entire DataFrame.

Example: Count All Distinct Rows

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("distinct_count").getOrCreate()

data = [
    ("Ram", "MCA", 80),
    ("Riya", "MBA", 85),
    ("Jiya", "B.E", 60),
    ("Maria", "B.Tech", 65),
    ("Shreya", "B.Sc", 91),
    ("Ram", "MCA", 80),       # duplicate
    ("John", "M.E", 85),
    ("Shyam", "BA", 70),
    ("Kumar", "B.Sc", 78),
    ("Maria", "B.Tech", 65),  # duplicate
]

columns = ["Name", "Course", "Marks"]
df = spark.createDataFrame(data, columns)

df.show()
print("Total records:", df.count())
print("Distinct records:", df.distinct().count())

Output:

+------+------+-----+
|  Name|Course|Marks|
+------+------+-----+
|   Ram|   MCA|   80|
|  Riya|   MBA|   85|
|  Jiya|   B.E|   60|
| Maria|B.Tech|   65|
|Shreya|  B.Sc|   91|
|   Ram|   MCA|   80|
|  John|   M.E|   85|
| Shyam|    BA|   70|
| Kumar|  B.Sc|   78|
| Maria|B.Tech|   65|
+------+------+-----+

Total records: 10
Distinct records: 8

The DataFrame has 10 total rows but only 8 distinct rows - the two duplicate entries for "Ram" and "Maria" are counted only once.

info

distinct() compares all columns in each row. Two rows are considered duplicates only if every column value matches. If you need to count distinct values in specific columns, use countDistinct() or dropDuplicates() with a column subset.

Method 2: Using `countDistinct()`

The countDistinct() function from pyspark.sql.functions is an aggregate function that returns the number of distinct values across one or more specified columns. Unlike distinct().count(), it lets you target specific columns without affecting the rest of the DataFrame.

Count Distinct Across All Columns

from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct

spark = SparkSession.builder.appName("distinct_count").getOrCreate()

data = [
    ("Ram", "IT", 80000),
    ("Shyam", "Sales", 70000),
    ("Jiya", "Sales", 60000),
    ("Maria", "Accounts", 65000),
    ("Ramesh", "IT", 80000),
    ("John", "Management", 80000),
    ("Shyam", "Sales", 70000),     # duplicate
    ("Kumar", "Sales", 78000),
    ("Maria", "Accounts", 65000),  # duplicate
]

columns = ["Emp_Name", "Department", "Salary"]
df = spark.createDataFrame(data, columns)

df.show()

# Count distinct across all three columns
df.select(countDistinct("Emp_Name", "Department", "Salary").alias("Distinct_Count")).show()

Output:

+--------+----------+------+
|Emp_Name|Department|Salary|
+--------+----------+------+
|     Ram|        IT| 80000|
|   Shyam|     Sales| 70000|
|    Jiya|     Sales| 60000|
|   Maria|  Accounts| 65000|
|  Ramesh|        IT| 80000|
|    John|Management| 80000|
|   Shyam|     Sales| 70000|
|   Kumar|     Sales| 78000|
|   Maria|  Accounts| 65000|
+--------+----------+------+

+--------------+
|Distinct_Count|
+--------------+
|             7|
+--------------+

Count Distinct in a Single Column

You can also count distinct values in just one column:

df.select(countDistinct("Department").alias("Distinct_Departments")).show()

Output:

+--------------------+
|Distinct_Departments|
+--------------------+
|                   4|
+--------------------+

There are 4 distinct departments: IT, Sales, Accounts, and Management.

Count Distinct for Multiple Columns Individually

To get distinct counts for each column separately, apply countDistinct() to each column within a single select():

from pyspark.sql.functions import countDistinct

df.select(
    countDistinct("Emp_Name").alias("Distinct_Names"),
    countDistinct("Department").alias("Distinct_Depts"),
    countDistinct("Salary").alias("Distinct_Salaries"),
).show()

Output:

+--------------+--------------+-----------------+
|Distinct_Names|Distinct_Depts|Distinct_Salaries|
+--------------+--------------+-----------------+
|             7|             4|                5|
+--------------+--------------+-----------------+

tip

Use .alias() to give meaningful names to the result columns. Without it, PySpark generates verbose default names like count(DISTINCT Emp_Name, Department, Salary).

Method 3: Using SQL Queries

If you prefer SQL syntax, you can register your DataFrame as a temporary view and run standard SQL COUNT(DISTINCT ...) queries.

Setup

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("distinct_count").getOrCreate()

data = [
    ("Ram", "IT", 44, 80000),
    ("Shyam", "Sales", 45, 70000),
    ("Jiya", "Sales", 30, 60000),
    ("Maria", "Accounts", 29, 65000),
    ("Ram", "IT", 38, 80000),
    ("John", "Management", 35, 80000),
    ("Shyam", "Sales", 45, 70000),    # duplicate
    ("Kumar", "Sales", 27, 70000),
    ("Maria", "Accounts", 32, 65000),
    ("Ria", "Management", 32, 65000),
]

columns = ["Emp_Name", "Department", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

# Register as a temporary SQL view
df.createOrReplaceTempView("employees")

Count All Distinct Rows

spark.sql("""
    SELECT COUNT(DISTINCT Emp_Name, Department, Age, Salary) AS Distinct_Count
    FROM employees
""").show()

Output:

+--------------+
|Distinct_Count|
+--------------+
|             9|
+--------------+

Count Distinct in Specific Columns

spark.sql("""
    SELECT COUNT(DISTINCT Emp_Name, Salary) AS Distinct_Name_Salary
    FROM employees
""").show()

Output:

+--------------------+
|Distinct_Name_Salary|
+--------------------+
|                   7|
+--------------------+

Count Distinct per Group

SQL makes it easy to count distinct values within groups:

spark.sql("""
    SELECT Department,
           COUNT(DISTINCT Emp_Name) AS Unique_Employees
    FROM employees
    GROUP BY Department
    ORDER BY Unique_Employees DESC
""").show()

Output:

+----------+----------------+
|Department|Unique_Employees|
+----------+----------------+
|     Sales|               3|
|  Accounts|               1|
|        IT|               1|
|Management|               2|
+----------+----------------+

Common Mistake: Confusing `distinct()` with `dropDuplicates()`

Both distinct() and dropDuplicates() remove duplicate rows, but dropDuplicates() allows you to specify which columns to consider for deduplication - which distinct() does not.

Using distinct() - considers ALL columns:

# These two rows differ in Age, so distinct() keeps both
data = [("Ram", "IT", 44, 80000), ("Ram", "IT", 38, 80000)]
df = spark.createDataFrame(data, ["Name", "Dept", "Age", "Salary"])

print("distinct() count:", df.distinct().count())  # 2

Using dropDuplicates() - considers only specified columns:

# Only checking Name and Dept - treats both rows as duplicates
print("dropDuplicates count:", df.dropDuplicates(["Name", "Dept"]).count())  # 1

warning

If you want to count distinct values based on a subset of columns, use dropDuplicates(subset).count() or countDistinct() - not distinct().count(), which always considers all columns.

Comparison of Methods

Method	Scope	Returns	Best For
`distinct().count()`	All columns	Single integer	Quick total distinct row count
`countDistinct(cols)`	Specified columns	DataFrame with one row	Distinct count on specific columns
SQL `COUNT(DISTINCT)`	Specified columns	DataFrame with one row	SQL-familiar workflows, grouped counts
`dropDuplicates(cols).count()`	Specified columns	Single integer	Subset-based distinct count

Summary

PySpark offers flexible ways to count distinct values depending on your needs:

Use distinct().count() for a quick count of unique rows across all columns.
Use countDistinct() when you need distinct counts for specific columns or want the result as a DataFrame column.
Use SQL queries with COUNT(DISTINCT ...) when working with temporary views or when you need GROUP BY functionality.
Use dropDuplicates(subset).count() when you need to deduplicate based on a subset of columns before counting.

Remember to use .alias() to give readable names to aggregated columns, and be mindful of the difference between distinct() (all columns) and dropDuplicates() (configurable columns) to avoid unexpected results.

Method 1: Using distinct().count()​

Example: Count All Distinct Rows​

Method 2: Using countDistinct()​

Count Distinct Across All Columns​

Count Distinct in a Single Column​

Count Distinct for Multiple Columns Individually​

Method 3: Using SQL Queries​

Setup​

Count All Distinct Rows​

Count Distinct in Specific Columns​

Count Distinct per Group​

Common Mistake: Confusing distinct() with dropDuplicates()​

Comparison of Methods​

Summary​

Table of Contents

Method 1: Using `distinct().count()`

Example: Count All Distinct Rows

Method 2: Using `countDistinct()`

Count Distinct Across All Columns

Count Distinct in a Single Column

Count Distinct for Multiple Columns Individually

Method 3: Using SQL Queries

Setup

Count All Distinct Rows

Count Distinct in Specific Columns

Count Distinct per Group

Common Mistake: Confusing `distinct()` with `dropDuplicates()`

Comparison of Methods

Summary