Skip to main content

Python PySpark: How to Count Distinct Values in a PySpark DataFrame

Counting distinct (unique) values is a fundamental operation in data analysis - whether you need to find the number of unique customers, deduplicate records, or validate data quality. PySpark provides multiple ways to count distinct values across an entire DataFrame or within specific columns.

In this guide, you will learn three approaches to count distinct values in PySpark: using distinct().count(), the countDistinct() function, and SQL queries. Each method is demonstrated with practical examples and clear outputs.

Method 1: Using distinct().count()

The most straightforward approach chains two DataFrame methods:

  • distinct() removes duplicate rows by comparing all columns.
  • count() returns the total number of remaining rows.

This gives you the count of unique rows across the entire DataFrame.

Example: Count All Distinct Rows

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("distinct_count").getOrCreate()

data = [
("Ram", "MCA", 80),
("Riya", "MBA", 85),
("Jiya", "B.E", 60),
("Maria", "B.Tech", 65),
("Shreya", "B.Sc", 91),
("Ram", "MCA", 80), # duplicate
("John", "M.E", 85),
("Shyam", "BA", 70),
("Kumar", "B.Sc", 78),
("Maria", "B.Tech", 65), # duplicate
]

columns = ["Name", "Course", "Marks"]
df = spark.createDataFrame(data, columns)

df.show()
print("Total records:", df.count())
print("Distinct records:", df.distinct().count())

Output:

+------+------+-----+
| Name|Course|Marks|
+------+------+-----+
| Ram| MCA| 80|
| Riya| MBA| 85|
| Jiya| B.E| 60|
| Maria|B.Tech| 65|
|Shreya| B.Sc| 91|
| Ram| MCA| 80|
| John| M.E| 85|
| Shyam| BA| 70|
| Kumar| B.Sc| 78|
| Maria|B.Tech| 65|
+------+------+-----+

Total records: 10
Distinct records: 8

The DataFrame has 10 total rows but only 8 distinct rows - the two duplicate entries for "Ram" and "Maria" are counted only once.

info

distinct() compares all columns in each row. Two rows are considered duplicates only if every column value matches. If you need to count distinct values in specific columns, use countDistinct() or dropDuplicates() with a column subset.

Method 2: Using countDistinct()

The countDistinct() function from pyspark.sql.functions is an aggregate function that returns the number of distinct values across one or more specified columns. Unlike distinct().count(), it lets you target specific columns without affecting the rest of the DataFrame.

Count Distinct Across All Columns

from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct

spark = SparkSession.builder.appName("distinct_count").getOrCreate()

data = [
("Ram", "IT", 80000),
("Shyam", "Sales", 70000),
("Jiya", "Sales", 60000),
("Maria", "Accounts", 65000),
("Ramesh", "IT", 80000),
("John", "Management", 80000),
("Shyam", "Sales", 70000), # duplicate
("Kumar", "Sales", 78000),
("Maria", "Accounts", 65000), # duplicate
]

columns = ["Emp_Name", "Department", "Salary"]
df = spark.createDataFrame(data, columns)

df.show()

# Count distinct across all three columns
df.select(countDistinct("Emp_Name", "Department", "Salary").alias("Distinct_Count")).show()

Output:

+--------+----------+------+
|Emp_Name|Department|Salary|
+--------+----------+------+
| Ram| IT| 80000|
| Shyam| Sales| 70000|
| Jiya| Sales| 60000|
| Maria| Accounts| 65000|
| Ramesh| IT| 80000|
| John|Management| 80000|
| Shyam| Sales| 70000|
| Kumar| Sales| 78000|
| Maria| Accounts| 65000|
+--------+----------+------+

+--------------+
|Distinct_Count|
+--------------+
| 7|
+--------------+

Count Distinct in a Single Column

You can also count distinct values in just one column:

df.select(countDistinct("Department").alias("Distinct_Departments")).show()

Output:

+--------------------+
|Distinct_Departments|
+--------------------+
| 4|
+--------------------+

There are 4 distinct departments: IT, Sales, Accounts, and Management.

Count Distinct for Multiple Columns Individually

To get distinct counts for each column separately, apply countDistinct() to each column within a single select():

from pyspark.sql.functions import countDistinct

df.select(
countDistinct("Emp_Name").alias("Distinct_Names"),
countDistinct("Department").alias("Distinct_Depts"),
countDistinct("Salary").alias("Distinct_Salaries"),
).show()

Output:

+--------------+--------------+-----------------+
|Distinct_Names|Distinct_Depts|Distinct_Salaries|
+--------------+--------------+-----------------+
| 7| 4| 5|
+--------------+--------------+-----------------+
tip

Use .alias() to give meaningful names to the result columns. Without it, PySpark generates verbose default names like count(DISTINCT Emp_Name, Department, Salary).

Method 3: Using SQL Queries

If you prefer SQL syntax, you can register your DataFrame as a temporary view and run standard SQL COUNT(DISTINCT ...) queries.

Setup

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("distinct_count").getOrCreate()

data = [
("Ram", "IT", 44, 80000),
("Shyam", "Sales", 45, 70000),
("Jiya", "Sales", 30, 60000),
("Maria", "Accounts", 29, 65000),
("Ram", "IT", 38, 80000),
("John", "Management", 35, 80000),
("Shyam", "Sales", 45, 70000), # duplicate
("Kumar", "Sales", 27, 70000),
("Maria", "Accounts", 32, 65000),
("Ria", "Management", 32, 65000),
]

columns = ["Emp_Name", "Department", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

# Register as a temporary SQL view
df.createOrReplaceTempView("employees")

Count All Distinct Rows

spark.sql("""
SELECT COUNT(DISTINCT Emp_Name, Department, Age, Salary) AS Distinct_Count
FROM employees
""").show()

Output:

+--------------+
|Distinct_Count|
+--------------+
| 9|
+--------------+

Count Distinct in Specific Columns

spark.sql("""
SELECT COUNT(DISTINCT Emp_Name, Salary) AS Distinct_Name_Salary
FROM employees
""").show()

Output:

+--------------------+
|Distinct_Name_Salary|
+--------------------+
| 7|
+--------------------+

Count Distinct per Group

SQL makes it easy to count distinct values within groups:

spark.sql("""
SELECT Department,
COUNT(DISTINCT Emp_Name) AS Unique_Employees
FROM employees
GROUP BY Department
ORDER BY Unique_Employees DESC
""").show()

Output:

+----------+----------------+
|Department|Unique_Employees|
+----------+----------------+
| Sales| 3|
| Accounts| 1|
| IT| 1|
|Management| 2|
+----------+----------------+

Common Mistake: Confusing distinct() with dropDuplicates()

Both distinct() and dropDuplicates() remove duplicate rows, but dropDuplicates() allows you to specify which columns to consider for deduplication - which distinct() does not.

Using distinct() - considers ALL columns:

# These two rows differ in Age, so distinct() keeps both
data = [("Ram", "IT", 44, 80000), ("Ram", "IT", 38, 80000)]
df = spark.createDataFrame(data, ["Name", "Dept", "Age", "Salary"])

print("distinct() count:", df.distinct().count()) # 2

Using dropDuplicates() - considers only specified columns:

# Only checking Name and Dept - treats both rows as duplicates
print("dropDuplicates count:", df.dropDuplicates(["Name", "Dept"]).count()) # 1
caution

If you want to count distinct values based on a subset of columns, use dropDuplicates(subset).count() or countDistinct() - not distinct().count(), which always considers all columns.

Comparison of Methods

MethodScopeReturnsBest For
distinct().count()All columnsSingle integerQuick total distinct row count
countDistinct(cols)Specified columnsDataFrame with one rowDistinct count on specific columns
SQL COUNT(DISTINCT)Specified columnsDataFrame with one rowSQL-familiar workflows, grouped counts
dropDuplicates(cols).count()Specified columnsSingle integerSubset-based distinct count

Summary

PySpark offers flexible ways to count distinct values depending on your needs:

  • Use distinct().count() for a quick count of unique rows across all columns.
  • Use countDistinct() when you need distinct counts for specific columns or want the result as a DataFrame column.
  • Use SQL queries with COUNT(DISTINCT ...) when working with temporary views or when you need GROUP BY functionality.
  • Use dropDuplicates(subset).count() when you need to deduplicate based on a subset of columns before counting.

Remember to use .alias() to give readable names to aggregated columns, and be mindful of the difference between distinct() (all columns) and dropDuplicates() (configurable columns) to avoid unexpected results.