SQL Fundamentals – Set Operations

Combining Query Results with Set Operations

Sometimes you need to merge, compare, or contrast results from completely different queries. That's where set operations come in – they treat query results as mathematical sets and perform operations like union (combine everything), intersection (find common elements), and difference (find what's in one set but not the other).

These operations work on entire result sets, not individual rows. They stack results vertically, creating one combined dataset from multiple queries. Think of them like venn diagram operations – combining circles, finding overlaps, or identifying unique portions.

SQL provides three primary set operations:

UNION / UNION ALL: Combine results from multiple queries
INTERSECT: Find rows that appear in both queries
EXCEPT (or MINUS): Find rows in the first query that don't appear in the second

All set operations follow the same fundamental rules: queries must return the same number of columns, and corresponding columns must have compatible data types. The result uses column names from the first query.

In test automation, these operations help you compare datasets, merge results from different environments, or identify discrepancies between expected and actual data.

UNION: Combining Results with Deduplication

The UNION operator stacks results from two or more queries vertically, automatically removing duplicate rows. It's perfect when you want to create a unified list from multiple sources without repetition.

Here's a scenario: you want to create a report showing both active users and recently deactivated users, but marked differently:

SELECT UserId, Email, FirstName, LastName, 'Active' AS Status
FROM Users
WHERE IsActive = 1

UNION

SELECT UserId, Email, FirstName, LastName, 'Recently Deactivated' AS Status
FROM Users
WHERE IsActive = 0
  AND DeactivatedDate > CURRENT_DATE - INTERVAL '7 days';

The result combines both groups into one table:

UserId | Email              | FirstName | LastName | Status
-------|--------------------|-----------|----------|----------------------
101    | [email protected]   | John      | Smith    | Active
102    | [email protected]   | Jane      | Doe      | Active
103    | [email protected]    | Bob       | Wilson   | Recently Deactivated
104    | [email protected]  | Alice     | Johnson  | Active

If a row appears in both result sets (which shouldn't happen in this example since a user can't be both active and deactivated), UNION includes it only once. The deduplication process compares all columns – two rows are considered duplicates only if every column value matches.

Rules for UNION

Column count must match: Each SELECT must return the same number of columns
Data types must be compatible: The first column of each query must have compatible types, the second column must match, and so on
Column names come from the first query: The result set uses column names from the first SELECT
Automatic sorting for deduplication: UNION typically sorts results to identify duplicates, which adds processing overhead

Use UNION when you need to ensure each unique row appears exactly once in the combined result set.

UNION ALL: Keep All Rows Including Duplicates

If you want to keep all rows from both queries, including duplicates, use UNION ALL. It simply appends the second result set to the first without any deduplication:

SELECT CategoryId, 'Low Price' AS PriceRange, Price
FROM Products
WHERE Price < 50

UNION ALL

SELECT CategoryId, 'High Stock' AS StockRange, StockQuantity
FROM Products
WHERE StockQuantity > 100;

If a product costs less than $50 and has over 100 units in stock, it appears twice in the results – once from each query. UNION ALL doesn't perform deduplication, making it significantly faster than plain UNION.

When to Choose UNION vs UNION ALL

Use UNION when:

You need to eliminate duplicates and ensure each row appears only once
The queries might produce overlapping results
You're creating a master list where uniqueness matters

Use UNION ALL when:

You know there won't be duplicates (the queries are logically exclusive)
You specifically want to preserve all rows, including duplicates
Performance matters and the deduplication overhead is unnecessary

Here's a practical example where UNION ALL is clearly the right choice:

-- Combine orders from different years (no duplicates possible)
SELECT OrderId, UserId, OrderDate, TotalAmount, 'Q1 2024' AS Period
FROM Orders
WHERE OrderDate BETWEEN '2024-01-01' AND '2024-03-31'

UNION ALL

SELECT OrderId, UserId, OrderDate, TotalAmount, 'Q2 2024' AS Period
FROM Orders
WHERE OrderDate BETWEEN '2024-04-01' AND '2024-06-30'

-- ... Q3 and Q4

Since the date ranges are mutually exclusive, no order can appear in multiple quarters. Using UNION ALL avoids unnecessary deduplication overhead.

Performance Tip: Default to UNION ALL

Unless you specifically need deduplication, use UNION ALL. It's faster because it skips the sorting and comparison steps required to eliminate duplicates. If your queries naturally produce distinct results (different date ranges, different categories, different statuses), UNION ALL gives you the same logical result with better performance.

In test automation, UNION ALL helps you combine test results from different environments or merge log entries from multiple sources without losing any data.

INTERSECT: Finding Common Rows

The INTERSECT operator returns only rows that appear in both query results. It's the SQL equivalent of finding the overlap in a venn diagram.

-- Find users who have both placed orders AND written product reviews
SELECT UserId, Email
FROM Users
WHERE UserId IN (SELECT UserId FROM Orders)

INTERSECT

SELECT UserId, Email
FROM Users
WHERE UserId IN (SELECT UserId FROM ProductReviews);

This returns only users who appear in both result sets – those who have placed orders and written reviews. Users who only ordered (but never reviewed) or only reviewed (but never ordered) don't appear.

You can achieve the same result with joins or EXISTS, but INTERSECT expresses the intent more clearly when you're explicitly looking for the overlap between two sets:

-- Alternative using EXISTS (more verbose)
SELECT UserId, Email
FROM Users u
WHERE EXISTS (SELECT 1 FROM Orders WHERE UserId = u.UserId)
  AND EXISTS (SELECT 1 FROM ProductReviews WHERE UserId = u.UserId);

EXCEPT: Finding Differences Between Sets

The EXCEPT operator (called MINUS in Oracle) returns rows from the first query that don't appear in the second query. It's perfect for finding what's missing or identifying differences between datasets.

-- Find users who have placed orders but never written reviews
SELECT UserId, Email
FROM Users
WHERE UserId IN (SELECT UserId FROM Orders)

EXCEPT

SELECT UserId, Email
FROM Users
WHERE UserId IN (SELECT UserId FROM ProductReviews);

This returns users who appear in the first result set (users with orders) but not in the second result set (users with reviews). Users who have both ordered and reviewed are excluded from the results.

Order matters with EXCEPT. Switching the queries reverses the logic:

-- Find users who have written reviews but never placed orders
SELECT UserId, Email
FROM Users
WHERE UserId IN (SELECT UserId FROM ProductReviews)

EXCEPT

SELECT UserId, Email
FROM Users
WHERE UserId IN (SELECT UserId FROM Orders);

This finds the opposite set – reviewers who haven't purchased anything.