SQL Fundamentals – Subqueries

Queries Within Queries

A subquery is simply a query nested inside another query. Think of it as a two-step process compressed into one statement: the inner query runs first and produces a result, then the outer query uses that result to complete its work.

You've already been writing multi-step logic in your tests – fetch some data, store it in a variable, then use that variable in another operation. Subqueries let you express that same pattern entirely within SQL, eliminating the round-trip to your application code.

For example, imagine you want to find all products priced above the average. Without subqueries, you'd need two separate queries:

// Step 1: Calculate average price
var avgPrice = connection.QuerySingle(
    "SELECT AVG(Price) FROM Products");

// Step 2: Find products above that average
var expensiveProducts = connection.Query(
    "SELECT * FROM Products WHERE Price > @AvgPrice",
    new { AvgPrice = avgPrice });

With a subquery, you collapse this into a single SQL statement:

SELECT ProductId, Name, Price
FROM Products
WHERE Price > (SELECT AVG(Price) FROM Products);

The database calculates the average, then immediately uses it to filter products – all in one atomic operation.

Subqueries in the WHERE Clause with IN

The most common use of subqueries is filtering rows based on values from another table. The IN operator combined with a subquery lets you check if a value exists in a dynamically generated list.

Here's a query that finds all users who have placed at least one order:

SELECT UserId, Email, FirstName, LastName
FROM Users
WHERE UserId IN (SELECT DISTINCT UserId FROM Orders);

The subquery (SELECT DISTINCT UserId FROM Orders) produces a list of user IDs that appear in the Orders table. The outer query then filters Users to show only those whose UserId appears in that list.

This is conceptually similar to a JOIN, but sometimes a subquery reads more naturally, especially when you're thinking in terms of "existence" rather than "combining data." Compare these two equivalent queries:

-- Using subquery (reads as: "users who are in the orders list")
SELECT UserId, Email
FROM Users
WHERE UserId IN (SELECT DISTINCT UserId FROM Orders);

-- Using JOIN (reads as: "combine users and orders, show unique users")
SELECT DISTINCT u.UserId, u.Email
FROM Users u
INNER JOIN Orders o ON u.UserId = o.UserId;

Both produce the same result. The subquery version explicitly separates the "get the list" step from the "filter using the list" step, which can be clearer when the logic is complex.

Use DISTINCT in subqueries that might return duplicate values – it prevents checking the same value multiple times and can improve performance.

NOT IN for Finding Gaps

The inverse pattern – NOT IN – finds rows that don't match any value in the subquery result. This is incredibly useful for identifying missing relationships:

SELECT UserId, Email, FirstName, LastName
FROM Users
WHERE UserId NOT IN (SELECT DISTINCT UserId FROM Orders);

This returns users who have never placed an order – the equivalent of the LEFT JOIN ... WHERE right_table.id IS NULL pattern you learned in the previous lesson.

Here's a critical gotcha with NOT IN: if the subquery returns any NULL values, the entire NOT IN condition evaluates to NULL (not TRUE or FALSE), causing zero rows to match. Consider this scenario:

-- This might return zero rows unexpectedly
SELECT ProductId, Name
FROM Products
WHERE ProductId NOT IN (SELECT ProductId FROM OrderItems);

If OrderItems.ProductId allows NULL values and contains any, this query returns nothing. The fix? Filter out NULL values in the subquery:

SELECT ProductId, Name
FROM Products
WHERE ProductId NOT IN (
    SELECT ProductId
    FROM OrderItems
    WHERE ProductId IS NOT NULL
);

NOT IN and NULL Values

Always add WHERE column IS NOT NULL to subqueries used with NOT IN. This prevents the NULL comparison trap that silently breaks your query logic. Alternatively, use NOT EXISTS (covered later), which handles NULL values correctly by default.

In test automation, NOT IN helps you find data gaps – users without profiles, orders without tracking numbers, or any expected relationship that's missing.

Scalar Subqueries: Single Value Results

A scalar subquery returns exactly one value (one row, one column). You can use scalar subqueries anywhere you'd use a single value – in WHERE clauses, comparisons, or even in calculations.

The products-above-average example from earlier is a scalar subquery:

SELECT ProductId, Name, Price
FROM Products
WHERE Price > (SELECT AVG(Price) FROM Products);

The subquery (SELECT AVG(Price) FROM Products) returns a single number – perhaps 65.50. The database then evaluates WHERE Price > 65.50 for each product.

Here's another example that finds orders exceeding the average order value:

SELECT OrderId,
       UserId,
       TotalAmount,
       (SELECT AVG(TotalAmount) FROM Orders) AS AvgOrderValue
FROM Orders
WHERE TotalAmount > (SELECT AVG(TotalAmount) FROM Orders)
ORDER BY TotalAmount DESC;

Notice the subquery appears twice – once in WHERE for filtering, once in SELECT to show the comparison baseline. The database typically optimizes this to calculate the average only once.

Scalar subqueries must return exactly one value. If they return multiple rows or zero rows, you'll get an error. Aggregate functions like AVG(), MAX(), MIN(), and COUNT() are safe because they always return one value.

In testing, scalar subqueries help you create self-referencing validation queries that compare individual records against dataset-wide statistics.

Subqueries in the SELECT Clause

You can embed subqueries directly in the SELECT clause to calculate values for each row. Each subquery runs once per row in the outer query:

SELECT u.UserId,
       u.Email,
       u.FirstName,
       (SELECT COUNT(*)
        FROM Orders o
        WHERE o.UserId = u.UserId) AS OrderCount,
       (SELECT SUM(TotalAmount)
        FROM Orders o
        WHERE o.UserId = u.UserId) AS TotalSpent
FROM Users u;

For each user, this query runs two subqueries: one counting their orders, another summing their spending. The result includes these calculated values as columns:

UserId | Email              | FirstName | OrderCount | TotalSpent
-------|--------------------|-----------|-----------|-----------
101    | [email protected]   | John      | 5         | 450.00
102    | [email protected]   | Jane      | 0         | NULL
103    | [email protected]    | Bob       | 12        | 1250.00

Notice Jane has OrderCount = 0 but TotalSpent = NULL. That's because COUNT(*) returns 0 for no rows, while SUM() returns NULL for no rows. You can fix this with COALESCE:

SELECT u.UserId,
       u.Email,
       (SELECT COALESCE(SUM(TotalAmount), 0)
        FROM Orders o
        WHERE o.UserId = u.UserId) AS TotalSpent
FROM Users u;

Subqueries in SELECT are convenient but can be slower than JOIN with GROUP BY for large datasets, since they execute once per row. Use them when readability matters more than maximum performance, or when you're calculating values that don't easily map to JOIN patterns.

These are called correlated subqueries because they reference columns from the outer query (u.UserId in this case).

Subqueries in the FROM Clause

You can use a subquery in the FROM clause as if it were a table. This is called a derived table or inline view. You must give it an alias:

SELECT *
FROM (
    SELECT UserId, COUNT(*) AS OrderCount
    FROM Orders
    GROUP BY UserId
) AS UserOrders
WHERE OrderCount > 5;

The subquery creates a temporary result set with two columns (UserId and OrderCount), aliased as UserOrders. The outer query treats this exactly like a table, filtering to show only users with more than 5 orders.

This pattern shines when you need to perform operations on aggregated data. For example, finding users whose order count exceeds the average order count across all users:

SELECT UserId, OrderCount
FROM (
    SELECT UserId, COUNT(*) AS OrderCount
    FROM Orders
    GROUP BY UserId
) AS UserOrders
WHERE OrderCount > (
    SELECT AVG(OrderCount)
    FROM (
        SELECT COUNT(*) AS OrderCount
        FROM Orders
        GROUP BY UserId
    ) AS AvgCalc
);

This nests subqueries within subqueries: the innermost calculates order counts per user, the middle subquery averages those counts, and the outer query filters users above that average.

While powerful, deeply nested subqueries can become hard to read. Many developers prefer Common Table Expressions (CTEs) for complex multi-step queries – they're essentially named subqueries that improve readability. We won't cover CTEs in this fundamentals series, but they're worth exploring once you're comfortable with basic subqueries.

Use derived tables when you need to aggregate data first, then perform further calculations or filtering on those aggregates.

Correlated vs Non-Correlated Subqueries

Understanding the difference between these two types helps you predict query performance and choose the right approach.

Non-correlated subqueries are independent – they can run once, produce a result, and the outer query uses that result. Examples:

-- Non-correlated: subquery runs once, returns average
SELECT * FROM Products
WHERE Price > (SELECT AVG(Price) FROM Products);

-- Non-correlated: subquery runs once, returns list of user IDs
SELECT * FROM Users
WHERE UserId IN (SELECT UserId FROM Orders);

The database can execute these subqueries first, store the result, then use it for the outer query. Efficient and straightforward.

Correlated subqueries reference columns from the outer query, so they must run once for each row in the outer query:

-- Correlated: runs once per product
SELECT p1.ProductId, p1.Name, p1.Price
FROM Products p1
WHERE p1.Price > (
    SELECT AVG(p2.Price)
    FROM Products p2
    WHERE p2.CategoryId = p1.CategoryId  -- References outer query
);

This finds products priced above the average for their specific category. For each product in p1, the subquery calculates the average price for that product's category. If you have 1,000 products, the subquery runs 1,000 times.

Correlated subqueries are more flexible but potentially slower. Modern database optimizers are quite clever about minimizing redundant calculations, but for very large datasets, you might need to rewrite correlated subqueries as JOINs with window functions or other techniques.

When to Use Subqueries vs JOINs

Prefer subqueries when: You're checking existence or comparing to aggregates. The logic reads more naturally as "filter where something is true" rather than "combine these tables."

Prefer JOINs when: You need columns from multiple tables in your result, or when performance profiling shows subqueries are slow. JOIN with GROUP BY often performs better than correlated subqueries on large datasets.

There's no absolute rule – both approaches work. Prioritize readability first, then optimize if performance testing reveals issues.

Start with the approach that makes your query logic clearest to read, then profile and optimize if needed.

EXISTS and NOT EXISTS: Efficient Existence Checks

The EXISTS operator is one of SQL's most powerful and efficient tools for checking whether related data exists. Unlike IN, which builds a complete list of values to compare against, EXISTS simply checks: "Are there any rows that match this condition?" The moment it finds one matching row, it stops searching and returns true.

The basic syntax uses a correlated subquery:

SELECT UserId, Email, FirstName
FROM Users u
WHERE EXISTS (
    SELECT 1
    FROM Orders o
    WHERE o.UserId = u.UserId
);

This finds users who have placed at least one order. For each user in the outer query, the subquery checks if any orders exist for that user. The subquery doesn't need to return actual data – the convention is SELECT 1 because EXISTS only cares about whether rows exist, not what those rows contain.

You could write SELECT * or SELECT NULL instead of SELECT 1 – all work identically. The database never looks at the selected columns; it just checks if the WHERE condition produces any matching rows.

EXISTS vs IN: Performance Differences

Compare EXISTS to an equivalent query using IN:

-- Using EXISTS
SELECT UserId, Email, FirstName
FROM Users u
WHERE EXISTS (
    SELECT 1
    FROM Orders o
    WHERE o.UserId = u.UserId
);

-- Using IN (equivalent logic)
SELECT UserId, Email, FirstName
FROM Users
WHERE UserId IN (
    SELECT DISTINCT UserId
    FROM Orders
);

Both return the same results, but EXISTS often performs better, especially with large datasets. Here's why:

  • Short-circuit evaluation: EXISTS stops as soon as it finds one matching row. If a user has 100 orders, EXISTS finds the first one and moves on – it doesn't count all 100.
  • No list building: IN must build a complete list of all matching values (potentially thousands of user IDs), then check each user against that list. EXISTS checks each user individually without building an intermediate list.
  • Index optimization: Database query optimizers often handle EXISTS more efficiently, using indexes to quickly verify existence without scanning entire tables.

Adding Conditions to EXISTS

You can add complex conditions inside the EXISTS subquery to check for specific types of related data:

SELECT UserId, Email, FirstName
FROM Users u
WHERE EXISTS (
    SELECT 1
    FROM Orders o
    WHERE o.UserId = u.UserId
      AND o.TotalAmount > 1000
      AND o.Status = 'Completed'
);

This finds users who have placed at least one completed order over $1,000. The subquery can be as complex as needed – joins, aggregates, multiple conditions – EXISTS still just checks: "Do any rows match these criteria?"

NOT EXISTS: Finding Missing Relationships

NOT EXISTS is the inverse – it returns true when no matching rows are found. This is perfect for identifying gaps in your data:

SELECT UserId, Email, FirstName
FROM Users u
WHERE NOT EXISTS (
    SELECT 1
    FROM Orders o
    WHERE o.UserId = u.UserId
);

This finds users who have never placed an order. For each user, the database checks if any orders exist – when the answer is "no orders found," that user appears in the results.

NOT EXISTS vs NOT IN: Critical Differences

NOT EXISTS has a crucial advantage over NOT IN: it handles NULL values correctly. Consider this scenario:

-- This might return ZERO rows unexpectedly due to NULL
SELECT ProductId, Name
FROM Products
WHERE ProductId NOT IN (SELECT ProductId FROM OrderItems);

-- This works correctly even if OrderItems.ProductId contains NULL
SELECT ProductId, Name
FROM Products p
WHERE NOT EXISTS (
    SELECT 1
    FROM OrderItems oi
    WHERE oi.ProductId = p.ProductId
);

If OrderItems.ProductId contains any NULL values, the NOT IN query fails silently – it returns zero rows because SQL's three-valued logic (TRUE, FALSE, NULL) makes value NOT IN (list_with_null) evaluate to NULL (not TRUE or FALSE), causing no rows to match.

NOT EXISTS doesn't have this problem. It checks for the existence of matching rows, and NULL comparisons simply don't match – which is exactly what you want.

When to Use EXISTS

Use EXISTS when: You're checking whether related data exists, especially when verifying relationships or filtering based on the presence/absence of child records. It's perfect for questions like "Show me users who have orders" or "Find products that have never sold."

Use NOT EXISTS when: You need to find missing relationships. It's safer than NOT IN because it handles NULL values correctly, and it's often clearer in intent – you're explicitly checking "Does this relationship not exist?"

Prefer EXISTS over IN when: The subquery might return many values, or when you're dealing with potentially nullable columns. The performance difference can be significant on large datasets.

EXISTS and NOT EXISTS are essential tools in your SQL toolkit – they're fast, safe with NULL values, and express intent clearly when checking for the presence or absence of related data.