Correlated Subqueries
In the previous guide on subqueries, you learned how a nested SELECT can compute a value or a list of values that the outer query uses for filtering. Those subqueries were self-contained: they could run on their own, produce a result, and the outer query would simply consume that result. A correlated subquery works differently. It reaches into the outer query, references its columns, and must be re-evaluated for every single row the outer query processes.
This distinction makes the SQL correlated subquery both more powerful and more expensive than a regular subquery. It can answer row-specific questions that a standard subquery cannot, but it requires a clear understanding of how the database evaluates it. This guide breaks down the concept, walks through practical examples with outputs, covers the essential EXISTS and NOT EXISTS operators, and discusses performance so you can use correlated subqueries with confidence.
Regular Subquery vs Correlated Subquery
Before diving into correlated subqueries, let's clearly define the difference.
Regular (Non-Correlated) Subquery
A regular subquery is independent of the outer query. It runs once, produces a result, and that result is reused for every row in the outer query.
-- Regular subquery: runs once, returns a single value (the average)
SELECT product_name, price
FROM products
WHERE price > (
SELECT AVG(price)
FROM products
);
The inner query SELECT AVG(price) FROM products has no reference to the outer query. You could copy it, run it separately, and it would work perfectly on its own.
Correlated Subquery
A correlated subquery references a column from the outer query. Because of that reference, the subquery cannot run on its own. It depends on data from the current row being processed by the outer query, which means the database must execute the subquery once for every row in the outer query's result set.
-- Correlated subquery: references outer.category_id
SELECT p.product_name, p.price
FROM products AS p
WHERE p.price > (
SELECT AVG(p2.price)
FROM products AS p2
WHERE p2.category_id = p.category_id -- ← references outer query
);
Notice p.category_id in the subquery's WHERE clause. That p comes from the outer query. The subquery asks: "What is the average price for this specific product's category?" The answer changes depending on which row the outer query is currently examining.
The key indicator of a correlated subquery is that the inner query references a table alias or column from the outer query. If you removed the outer query, the subquery would fail because it would not know what p.category_id means.
The Sample Schema
We will use this schema for all examples:
CREATE TABLE departments (
department_id INT PRIMARY KEY,
department_name VARCHAR(100)
);
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department_id INT,
salary DECIMAL(10,2),
hire_date DATE
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2)
);
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
city VARCHAR(50)
);
-- departments
INSERT INTO departments VALUES (1, 'Engineering');
INSERT INTO departments VALUES (2, 'Marketing');
INSERT INTO departments VALUES (3, 'Sales');
-- employees
INSERT INTO employees VALUES (101, 'Alice', 1, 95000, '2020-03-15');
INSERT INTO employees VALUES (102, 'Bob', 1, 82000, '2021-07-01');
INSERT INTO employees VALUES (103, 'Carol', 1, 78000, '2022-01-10');
INSERT INTO employees VALUES (104, 'Dave', 2, 68000, '2019-11-20');
INSERT INTO employees VALUES (105, 'Eve', 2, 72000, '2021-05-05');
INSERT INTO employees VALUES (106, 'Frank', 3, 61000, '2023-02-14');
INSERT INTO employees VALUES (107, 'Grace', 3, 59000, '2022-08-30');
INSERT INTO employees VALUES (108, 'Hank', 3, 63000, '2020-06-01');
-- customers
INSERT INTO customers VALUES (1, 'Acme Corp', 'New York');
INSERT INTO customers VALUES (2, 'Beta Inc', 'Chicago');
INSERT INTO customers VALUES (3, 'Gamma LLC', 'New York');
INSERT INTO customers VALUES (4, 'Delta Partners', 'Denver');
-- orders
INSERT INTO orders VALUES (1001, 1, '2024-01-15', 500.00);
INSERT INTO orders VALUES (1002, 1, '2024-03-22', 1200.00);
INSERT INTO orders VALUES (1003, 2, '2024-02-10', 300.00);
INSERT INTO orders VALUES (1004, 3, '2024-04-05', 850.00);
INSERT INTO orders VALUES (1005, 1, '2024-05-18', 2100.00);
How Row-by-Row Evaluation Works
Understanding the execution model of a correlated subquery is essential. Let's trace through a concrete example step by step.
Goal: Find employees who earn more than the average salary in their own department.
SELECT
e.employee_name,
e.department_id,
e.salary
FROM employees AS e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees AS e2
WHERE e2.department_id = e.department_id
);
Here is how the database processes this logically:
Step 1: The outer query picks the first row: Alice (department 1, salary 95000).
The subquery becomes:
SELECT AVG(e2.salary) FROM employees AS e2 WHERE e2.department_id = 1
-- Result: (95000 + 82000 + 78000) / 3 = 85000
Is 95000 > 85000? Yes. Alice is included.
Step 2: The outer query picks the second row: Bob (department 1, salary 82000).
The subquery runs again with the same department but we are checking a different employee:
SELECT AVG(e2.salary) FROM employees AS e2 WHERE e2.department_id = 1
-- Result: 85000
Is 82000 > 85000? No. Bob is excluded.
Step 3: Carol (department 1, salary 78000). Average is still 85000. 78000 > 85000? No. Excluded.
Step 4: Dave (department 2, salary 68000).
The subquery now uses department 2:
SELECT AVG(e2.salary) FROM employees AS e2 WHERE e2.department_id = 2
-- Result: (68000 + 72000) / 2 = 70000
Is 68000 > 70000? No. Excluded.
Step 5: Eve (department 2, salary 72000). Average is 70000. 72000 > 70000? Yes. Included.
Step 6: Frank (department 3, salary 61000).
SELECT AVG(e2.salary) FROM employees AS e2 WHERE e2.department_id = 3
-- Result: (61000 + 59000 + 63000) / 3 = 61000
Is 61000 > 61000? No (not strictly greater). Excluded.
Step 7: Grace (department 3, salary 59000). 59000 > 61000? No. Excluded.
Step 8: Hank (department 3, salary 63000). 63000 > 61000? Yes. Included.
Final Output:
| employee_name | department_id | salary |
|---|---|---|
| Alice | 1 | 95000 |
| Eve | 2 | 72000 |
| Hank | 3 | 63000 |
The subquery executed 8 times, once per employee row. For departments with the same ID, the subquery returned the same average, but it was logically re-evaluated each time.
A regular subquery could only compute a single global average. The correlated subquery computes a per-group average without needing a GROUP BY in the outer query. This is its superpower.
More Correlated Subquery Examples
Finding Each Customer's Most Recent Order
SELECT
o.order_id,
o.customer_id,
o.order_date,
o.total_amount
FROM orders AS o
WHERE o.order_date = (
SELECT MAX(o2.order_date)
FROM orders AS o2
WHERE o2.customer_id = o.customer_id
);
For each order row, the subquery finds the maximum order date for that specific customer. Only orders matching the maximum date survive the filter.
Output:
| order_id | customer_id | order_date | total_amount |
|---|---|---|---|
| 1003 | 2 | 2024-02-10 | 300.00 |
| 1004 | 3 | 2024-04-05 | 850.00 |
| 1005 | 1 | 2024-05-18 | 2100.00 |
Customer 1 (Acme Corp) has three orders, but only the latest one (2024-05-18) appears.
Finding Employees Hired Before Anyone in Their Department
SELECT
e.employee_name,
e.department_id,
e.hire_date
FROM employees AS e
WHERE e.hire_date = (
SELECT MIN(e2.hire_date)
FROM employees AS e2
WHERE e2.department_id = e.department_id
);
Output:
| employee_name | department_id | hire_date |
|---|---|---|
| Alice | 1 | 2020-03-15 |
| Dave | 2 | 2019-11-20 |
| Hank | 3 | 2020-06-01 |
Each row represents the longest-tenured employee in their respective department.
Using a Correlated Subquery in the SELECT Clause
Correlated subqueries also work in the SELECT list to compute a per-row value:
SELECT
e.employee_name,
e.salary,
e.department_id,
(
SELECT AVG(e2.salary)
FROM employees AS e2
WHERE e2.department_id = e.department_id
) AS dept_avg_salary,
e.salary - (
SELECT AVG(e2.salary)
FROM employees AS e2
WHERE e2.department_id = e.department_id
) AS diff_from_dept_avg
FROM employees AS e
ORDER BY e.department_id, e.salary DESC;
Output:
| employee_name | salary | department_id | dept_avg_salary | diff_from_dept_avg |
|---|---|---|---|---|
| Alice | 95000 | 1 | 85000.00 | 10000.00 |
| Bob | 82000 | 1 | 85000.00 | -3000.00 |
| Carol | 78000 | 1 | 85000.00 | -7000.00 |
| Eve | 72000 | 2 | 70000.00 | 2000.00 |
| Dave | 68000 | 2 | 70000.00 | -2000.00 |
| Hank | 63000 | 3 | 61000.00 | 2000.00 |
| Frank | 61000 | 3 | 61000.00 | 0.00 |
| Grace | 59000 | 3 | 61000.00 | -2000.00 |
Every row shows its department's average alongside the individual salary and the difference. This would be difficult to achieve with a regular subquery because the average changes based on the current row's department.
EXISTS and NOT EXISTS
The EXISTS and NOT EXISTS operators are specifically designed to work with correlated subqueries. They do not return data values. They return TRUE or FALSE based on whether the subquery produces any rows at all.
How EXISTS Works
SELECT columns
FROM outer_table AS o
WHERE EXISTS (
SELECT 1
FROM inner_table AS i
WHERE i.some_column = o.some_column
);
For each row in the outer query, the database runs the correlated subquery. If the subquery returns at least one row, EXISTS evaluates to TRUE and the outer row is included. If the subquery returns zero rows, EXISTS evaluates to FALSE and the outer row is skipped.
Inside an EXISTS subquery, the SELECT list does not matter. You can write SELECT 1, SELECT *, or SELECT 'hello'. The database only cares whether any rows are returned, not what those rows contain. SELECT 1 is a common convention that signals: "I only care about existence, not values."
Example: Customers Who Have Placed Orders
SELECT
c.customer_name,
c.city
FROM customers AS c
WHERE EXISTS (
SELECT 1
FROM orders AS o
WHERE o.customer_id = c.customer_id
);
Output:
| customer_name | city |
|---|---|
| Acme Corp | New York |
| Beta Inc | Chicago |
| Gamma LLC | New York |
Delta Partners (customer 4) has no orders, so the EXISTS subquery returns no rows for that customer, and it is excluded.
EXISTS vs IN: A Comparison
You could achieve the same result with IN:
SELECT customer_name, city
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
);
Both queries return the same result. So when should you choose one over the other?
| Aspect | IN | EXISTS |
|---|---|---|
| Readability for simple cases | Slightly cleaner | Slightly more verbose |
| Handles NULLs safely | NOT IN breaks with NULLs | NOT EXISTS handles NULLs correctly |
| Performance on large inner results | Can be slower (materializes full list) | Often faster (stops at first match) |
| Correlated by nature | Not necessarily | Always correlated |
NOT EXISTS: Finding Absent Relationships
NOT EXISTS is the safe and recommended way to find rows with no matching rows in another table.
Find customers who have never placed an order:
SELECT
c.customer_name,
c.city
FROM customers AS c
WHERE NOT EXISTS (
SELECT 1
FROM orders AS o
WHERE o.customer_id = c.customer_id
);
Output:
| customer_name | city |
|---|---|
| Delta Partners | Denver |
For each customer, the subquery looks for at least one order. Delta Partners has none, so NOT EXISTS returns TRUE and that row is included.
Why NOT EXISTS Is Safer Than NOT IN
This is an important point worth emphasizing with a concrete example. Suppose someone inserts an order with a NULL customer_id:
INSERT INTO orders VALUES (1006, NULL, '2024-06-01', 150.00);
Now compare:
NOT IN (broken):
SELECT customer_name
FROM customers
WHERE customer_id NOT IN (
SELECT customer_id
FROM orders
);
Result: Returns zero rows, even though Delta Partners still has no orders. The NULL in the subquery result causes every comparison to return UNKNOWN, which means no row can satisfy the NOT IN condition.
NOT EXISTS (correct):
SELECT customer_name
FROM customers AS c
WHERE NOT EXISTS (
SELECT 1
FROM orders AS o
WHERE o.customer_id = c.customer_id
);
Result:
| customer_name |
|---|
| Delta Partners |
NOT EXISTS is unaffected by NULL values because it only checks whether matching rows exist. It does not compare values in a way that NULL can corrupt.
Always prefer NOT EXISTS over NOT IN when checking for absent relationships. NOT IN is a ticking time bomb that will silently return wrong results the moment a NULL value appears in the subquery column.
EXISTS with Additional Conditions
EXISTS subqueries can include any valid WHERE conditions, making them very flexible:
Find customers who have placed an order over $1000:
SELECT
c.customer_name,
c.city
FROM customers AS c
WHERE EXISTS (
SELECT 1
FROM orders AS o
WHERE o.customer_id = c.customer_id
AND o.total_amount > 1000
);
| customer_name | city |
|---|---|
| Acme Corp | New York |
Only Acme Corp has orders exceeding $1000 (the $1200 and $2100 orders).
Find departments that have at least one employee earning above $80000:
SELECT
d.department_name
FROM departments AS d
WHERE EXISTS (
SELECT 1
FROM employees AS e
WHERE e.department_id = d.department_id
AND e.salary > 80000
);
| department_name |
|---|
| Engineering |
Only Engineering has employees above that salary threshold (Alice at 95000 and Bob at 82000).
Correlated Subqueries for Row-Level Comparisons
One of the most practical uses of correlated subqueries is comparing each row against a computed value for its group. Here are patterns you will use frequently.
Pattern: Rows Above Their Group Average
SELECT
e.employee_name,
d.department_name,
e.salary
FROM employees AS e
JOIN departments AS d ON e.department_id = d.department_id
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees AS e2
WHERE e2.department_id = e.department_id
)
ORDER BY d.department_name, e.salary DESC;
| employee_name | department_name | salary |
|---|---|---|
| Alice | Engineering | 95000 |
| Eve | Marketing | 72000 |
| Hank | Sales | 63000 |
Pattern: Rows with the Maximum Value in Their Group
SELECT
e.employee_name,
d.department_name,
e.salary
FROM employees AS e
JOIN departments AS d ON e.department_id = d.department_id
WHERE e.salary = (
SELECT MAX(e2.salary)
FROM employees AS e2
WHERE e2.department_id = e.department_id
)
ORDER BY d.department_name;
| employee_name | department_name | salary |
|---|---|---|
| Alice | Engineering | 95000 |
| Eve | Marketing | 72000 |
| Hank | Sales | 63000 |
Pattern: Counting Related Rows per Row
SELECT
c.customer_name,
c.city,
(
SELECT COUNT(*)
FROM orders AS o
WHERE o.customer_id = c.customer_id
) AS order_count
FROM customers AS c
ORDER BY order_count DESC;
| customer_name | city | order_count |
|---|---|---|
| Acme Corp | New York | 3 |
| Beta Inc | Chicago | 1 |
| Gamma LLC | New York | 1 |
| Delta Partners | Denver | 0 |
This pattern is useful when you want a count alongside other columns without collapsing rows with GROUP BY.
Performance Considerations
Correlated subqueries have a reputation for being slow, and the concern is valid. Since the subquery is logically re-executed for every row in the outer query, the total work can grow quickly.
The Conceptual Cost
If the outer query processes N rows and the subquery scans M rows each time, the total work is approximately N x M operations. For small tables this is negligible. For large tables, it can become a bottleneck.
| Outer rows (N) | Inner rows per execution (M) | Total operations (N x M) |
|---|---|---|
| 100 | 100 | 10,000 |
| 10,000 | 10,000 | 100,000,000 |
| 1,000,000 | 1,000,000 | 1,000,000,000,000 |
Indexes Are Critical
The most important performance optimization for correlated subqueries is to ensure that the column referenced in the subquery's WHERE clause is indexed.
-- The subquery filters on department_id
WHERE e2.department_id = e.department_id
-- This index makes the subquery fast
CREATE INDEX idx_employees_dept ON employees(department_id);
Without this index, every execution of the subquery performs a full table scan. With the index, each execution becomes a quick lookup.
Always run EXPLAIN (or EXPLAIN ANALYZE in PostgreSQL) on queries containing correlated subqueries. Look for sequential scans inside loops as a red flag. Adding an index on the correlated column usually solves the problem.
EXPLAIN ANALYZE
SELECT e.employee_name, e.salary
FROM employees AS e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees AS e2
WHERE e2.department_id = e.department_id
);
Modern Optimizer Tricks
Modern databases (PostgreSQL, MySQL 8+, SQL Server, Oracle) do not always execute correlated subqueries row by row. The query optimizer often rewrites them internally as joins or semi-joins. For example, an EXISTS subquery is frequently transformed into a semi-join that executes much more efficiently than a naive row-by-row loop.
This means that a correlated subquery might perform identically to a hand-written join in many cases. However, you should not rely on this. Always verify with EXPLAIN.
When to Rewrite as a Join
If a correlated subquery is causing performance issues, you can often rewrite it using a JOIN and GROUP BY:
Correlated subquery version (can be slow on large tables):
SELECT
e.employee_name,
e.salary,
e.department_id
FROM employees AS e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees AS e2
WHERE e2.department_id = e.department_id
);
Join version (often faster):
SELECT
e.employee_name,
e.salary,
e.department_id
FROM employees AS e
JOIN (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
) AS dept_avg ON e.department_id = dept_avg.department_id
WHERE e.salary > dept_avg.avg_salary;
The join version computes the department averages once in the derived table, then joins them to the employees. No repeated execution. Both produce the same result, but the join version has a predictable O(N) execution cost.
Alternative using a CTE:
WITH dept_avg AS (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
)
SELECT
e.employee_name,
e.salary,
e.department_id
FROM employees AS e
JOIN dept_avg AS da ON e.department_id = da.department_id
WHERE e.salary > da.avg_salary;
The join-based rewrite is not always possible. Some correlated subquery patterns, especially those using EXISTS or NOT EXISTS with complex conditions, are best left as correlated subqueries and are often optimized well by the database engine.
Common Mistakes with Correlated Subqueries
Mistake 1: Using the Same Alias in Inner and Outer Queries
If you reuse the same table alias, the subquery references itself instead of the outer query, and you get wrong results without any error.
Wrong:
SELECT e.employee_name, e.salary
FROM employees AS e
WHERE e.salary > (
SELECT AVG(e.salary) -- "e" refers to inner scope, not outer!
FROM employees AS e -- Same alias as outer query
WHERE e.department_id = e.department_id -- Always true (self-reference)
);
This subquery compares e.department_id to itself, which is always true, so it computes the global average for every row. It does not compute per-department averages.
Correct:
SELECT e.employee_name, e.salary
FROM employees AS e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees AS e2 -- Different alias
WHERE e2.department_id = e.department_id -- Cross-reference to outer
);
Always use distinct aliases for the inner and outer instances of the same table. A common convention is to add a number suffix: e for the outer query, e2 for the inner query.
Mistake 2: Forgetting That Correlated Subqueries Cannot Run Independently
When debugging, you might try to copy the subquery and run it alone. It will fail because it references an alias from the outer query that no longer exists.
-- This will fail on its own:
SELECT AVG(e2.salary)
FROM employees AS e2
WHERE e2.department_id = e.department_id;
-- ERROR: "e" is not defined
To test a correlated subquery independently, you must substitute the outer reference with a literal value:
-- Test with department_id = 1
SELECT AVG(e2.salary)
FROM employees AS e2
WHERE e2.department_id = 1;
-- Result: 85000
Mistake 3: Using a Correlated Subquery When a Simple Subquery Suffices
Not every subquery needs to be correlated. If the inner query does not depend on the outer row, making it correlated adds unnecessary complexity and potentially hurts performance.
Unnecessarily correlated:
SELECT product_name, price
FROM products AS p
WHERE p.price > (
SELECT AVG(p2.price)
FROM products AS p2
WHERE 1 = 1 -- No actual reference to p
);
Simpler as a regular subquery:
SELECT product_name, price
FROM products
WHERE price > (
SELECT AVG(price)
FROM products
);
Both return the same result, but the second version is clearer and only executes the subquery once.
Quick Reference: Correlated Subquery Patterns
| Pattern | Template |
|---|---|
| Row above group average | WHERE col > (SELECT AVG(col) FROM t2 WHERE t2.group = t1.group) |
| Row with group maximum | WHERE col = (SELECT MAX(col) FROM t2 WHERE t2.group = t1.group) |
| Row has related records | WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.fk = t1.pk) |
| Row has no related records | WHERE NOT EXISTS (SELECT 1 FROM t2 WHERE t2.fk = t1.pk) |
| Count of related records | SELECT (SELECT COUNT(*) FROM t2 WHERE t2.fk = t1.pk) AS cnt |
Summary
A SQL correlated subquery is a subquery that references columns from the outer query, making it dependent on the outer query's current row. Unlike regular subqueries that execute once, correlated subqueries are logically evaluated once per row in the outer query's result set.
Key takeaways:
- A correlated subquery is identified by the presence of an outer table alias inside the inner query. If you remove the outer query, the subquery cannot run on its own.
- Row-by-row evaluation is the conceptual execution model. For each row the outer query processes, the subquery runs with that row's values substituted in.
EXISTSreturnsTRUEif the correlated subquery produces at least one row. It is ideal for checking whether a relationship exists.NOT EXISTSis the safest way to find rows without matching records in another table. UnlikeNOT IN, it handlesNULLvalues correctly.- Performance depends heavily on indexes. Always index the columns used in the subquery's
WHEREclause that reference the outer query. UseEXPLAINto verify execution plans. - When performance is a concern, consider rewriting the correlated subquery as a
JOINwith a derived table or CTE, which computes intermediate results once instead of per row. - Use distinct aliases for inner and outer instances of the same table to avoid accidental self-references.
Correlated subqueries are a powerful tool that lets you ask row-specific questions within a set-based language. Master them, and many complex reporting and filtering tasks that previously seemed impossible become straightforward SQL.