Tag CTE

Counting Children with CTEs

Have you ever wanted to get a running total of all of the descendants of each tree node? This sort of thing is useful, especially if you don’t want to pull back an entire object graph just to compute the count of a child collection.

IF OBJECT_ID('tempdb..#categories') IS NOT NULL
  DROP TABLE #categories;

IF OBJECT_ID('tempdb..#product_categories') IS NOT NULL
  DROP TABLE #product_categories;

CREATE TABLE #categories (CategoryId INT, ParentCategoryId INT);
CREATE TABLE #product_categories (CategoryId INT, ProductId INT);

INSERT INTO #categories VALUES (1, NULL);
INSERT INTO #categories VALUES (2, 1);
INSERT INTO #categories VALUES (3, 1);
INSERT INTO #categories VALUES (4, 2);
INSERT INTO #categories VALUES (5, 3);
INSERT INTO #categories VALUES (6, 2);

INSERT INTO #product_categories VALUES (1, 1);
INSERT INTO #product_categories VALUES (2, 15);
INSERT INTO #product_categories VALUES (2, 18);
INSERT INTO #product_categories VALUES (3, 12);
INSERT INTO #product_categories VALUES (6, 34);
INSERT INTO #product_categories VALUES (4, 35);
INSERT INTO #product_categories VALUES (5, 99);
INSERT INTO #product_categories VALUES (3, 43);
INSERT INTO #product_categories VALUES (6, 54);
INSERT INTO #product_categories VALUES (3, 92);
INSERT INTO #product_categories VALUES (2, 77);
INSERT INTO #product_categories VALUES (5, 62);
INSERT INTO #product_categories VALUES (4, 42);
INSERT INTO #product_categories VALUES (1, 11);

;
WITH  cte(CategoryId, ParentCategoryId, c)
        AS (SELECT  c1.CategoryId,
                    c1.ParentCategoryId,
                    (SELECT COUNT(*)
                     FROM   (SELECT DISTINCT
                                    ProductId
                             FROM   #product_categories AS pc
                             WHERE  pc.CategoryId = c1.CategoryId
                            ) AS t1
                    ) AS c
            FROM    #categories AS c1
            UNION ALL
            SELECT  c2.CategoryId,
                    c2.ParentCategoryID,
                    d.c
            FROM    #categories c2
                    INNER JOIN cte d ON c2.CategoryId = d.ParentCategoryId
           )
  SELECT  cte.CategoryId,
          cte.ParentCategoryId,
          SUM(c) AS ProductCount
  FROM    cte
  GROUP BY cte.CategoryId,
          cte.ParentCategoryId;

So, how does mess work? Carefully.

The innermost select, t1, builds a virtual table that is nothing more than the product count under a single Category. There is a DISTINCT on here just to make sure that there are no duplicates from the #product_categories table. Theoretically this should be a group by, but this was a super fast hotfix that needed to go into production ASAP. Mistakes happen.

Moving outwards we encounter a SELECT COUNT on t1. This is where we get the actual count of products under the current category. Pretty standard.

What’s interesting, to me at least, is that in the recursive portion of the CTE, we’re working our way UP the hierarchy, from the bottom. This makes sure that when we do a sum of c, grouped by category and parent category, we’ll have the proper count in our CTE results.

What’s the advantage of this approach? You don’t have nasty deeply nested sub-selects. You can use this to populate a ChildCount column in your table as a result of a trigger. This is also, usually, fast enough to run live or else to use as a materialized view.

Solving Business Problems with SQL

Businesses have all kind of interesting rules for working with data. Sometimes these rules are incredibly easy to implement in the database. Sometimes these rules are incredibly to implement in the application layer. Sometimes, it’s difficult to construct these rules no matter where you are in the entire application stack.

I recently came across a situation that required a bit of head scratching before I got things working correctly. What made this incredibly interesting to me was the apparent simplicity of the business rules.

The stored procedure must return a set of users, given an administrator’s user id, that the administrator is able to edit. An administrator is able to edit a user if the following criteria are met:

  1. There is at least one client in common between the administrator and the user.
  2. A user’s set of clients must all be members of the admin’s set of clients.
  3. If conditions 1 & 2 are not met, the user is excluded from the set and is effectively invisible to the administrator.

Before getting started working on real data, I created a sample set of data that I can share here:

IF OBJECT_ID('tempdb..#users') IS NOT NULL
  DROP TABLE #users;
GO

CREATE TABLE #users
(
  UserId INT,
  ClientId INT
);
GO

-- Admin
INSERT INTO #users VALUES (1, 1);
INSERT INTO #users VALUES (1, 2);
INSERT INTO #users VALUES (1, 3);

-- User with two clients
INSERT INTO #users VALUES (10, 1);
INSERT INTO #users VALUES (10, 2);

-- Another user with two clients
INSERT INTO #users VALUES (11, 2);
INSERT INTO #users VALUES (11, 3);

-- User with same clients as admin
INSERT INTO #users VALUES (12, 1);
INSERT INTO #users VALUES (12, 2);
INSERT INTO #users VALUES (12, 3);

-- User with overlapping set of clients
INSERT INTO #users VALUES (20, 2);
INSERT INTO #users VALUES (20, 3);
INSERT INTO #users VALUES (20, 4);

-- User with no matching clients
INSERT INTO #users VALUES (21, 4);
INSERT INTO #users VALUES (21, 5);
INSERT INTO #users VALUES (21, 6);

So, let’s start with an initial query:

SELECT *
  FROM #users AS a
       INNER JOIN #users AS b
          ON a.ClientId = b.ClientId

This really doesn’t tell us much, but it does give us a list of all users and any users who share the same client.

By adding a predicate to make sure that the user from the ‘a’ table is our administrator, like so:

SELECT *
  FROM #users AS a
       INNER JOIN #users AS b
          ON a.ClientId = b.ClientId
 WHERE a.UserId = 1

we’ve effectively fulfilled the first requirement:

There is at least one client in common between the administrator and the user.

Unfortunately, that’s the easy part. Determining if the user’s set of clients are a subset of the administrator’s set of clients is a little bit trickier. This is, actually, where a full outer join becomes incredibly helpful.

We’re going to change the query around considerably in order to get the desired results:

;WITH cte AS (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS row_num,
       a.UserId AS a_UserId,
       a.ClientId AS a_ClientId,
       b.UserId AS b_UserId,
       b.ClientId AS b_ClientId,
       COUNT(*) OVER (PARTITION BY a.UserId, b.UserId) AS admin_count,
       COUNT(*) OVER (PARTITION BY b.UserId) AS user_count
  FROM (SELECT UserId,
               ClientId
          FROM #users
         WHERE UserId = 1) AS a
       FULL OUTER JOIN (SELECT UserId,
                               ClientID
                          FROM #users) AS b ON a.ClientId = b.ClientId
)
SELECT c.*
  FROM cte AS c
 WHERE admin_count = user_count
   AND a_UserId  b_UserId
 ORDER BY b_UserId

So, what did we change? Well, rather than directly joining from #users to #users, we’re using two sub-selects and have changed from an inner join to a full outer join. Changing to a full outer join will, as everyone knows, give us all rows from both queries: even if the user has access to a client but the administrator does not, that user’s information will still be returned.

Second, we’ve changed the core query to only return the administrator, rather than joining on all rows where the Client Ids are the same. Why? Well, we don’t care if other users have access to the same set or subset of clients, we only care if the users and the administrator share a common set of clients.

What’s with all of the COUNT(*) OVER (…) nonsense? Well, by using COUNT(*) and PARTITION BY (as I talked about last week in Using Partitioning Functions to Find Duplicate Rows ) you can detect duplicate rows. What you can also do is determine the count of specific criteria by using the PARTITION BY clause to do some implicit grouping in your query.

The first partition by on both a.UserId and b.UserId provides the row count for the number of times the unique combination of administrator and user occur in the overall result set.

The second partition by function provides the row count of the number rows that contain the user id in the total result set.

If the administrator row count does not equal the user row count we know that there are more user rows than there are administrator rows.

Why did I include the ROW_NUMBER() in the query? When I’m writing these kinds of queries I will typically include the ROW_NUMBER() windowing function so I can apply different criteria in the OVER clause to help keep track of the query. This is especially important when you’re dealing with a query that contains a large number of surrogate keys.

Finally, I wrapped everything in a CTE. This makes it possible to actually apply the count comparisons that I mentioned above. It also makes it easier to perform any additional filtering on the column names that I might need in the future for additional business rules.

Related Reading

A Quick Introduction to Common Table Expressions

Before diving into how to go about using a common table expression, let’s take a look at what a common table expression is and why you would want to use one.

The What and Why of Common Table Expressions

Essentially, a common table expression (CTE) is a temporary result set. In this regard, a CTE is similar to a view, temporary table, or derived table. There are some important ways that a CTE is different from a view, temporary table, or a dervied table:

CTEs are not persisted, views are. If you only need a particular result set in a single query/stored procedure, creating a view will only cloud up the metadata that is being stored in the database. Using a CTE encapsulates this logic and stores it with the relevant query. If you don’t have the ability to create views, a CTE is also a great way around the lack of permissions.

A CTE can be referenced multiple times in the same statement. How is this better than referring to the same view/result set multiple times in a single query? For starters, every time you want to refer to the same result set from a view or query it’s necessary to repeat the query. This isn’t so bad if your query is as simple as ‘SELECT x, y, z FROM small_view’ but when you are creating a complicated result set that contains multiple aggregates, joins, and groupings your query will become cluttered very quickly. Not to mention the maintenance nightmare that this will cause when you have to change your query due to changes in the underlying table structure.

Every time you make use of a derived table, that query is going to be executed. When using a CTE, that result set is pulled back once and only once within a single query. Here’s a recent example I wrote: data is pulled back for multiple bills within a set of accounts. Within each account the total cost is aggregated by bill. In addition, four separate aggregations are made based on a subset of the result set for each bill. If I had done this with derived tables, that would come out to 5 hits to the underlying database. By using a CTE, only one main hit to disk is made, the result set is held in memory, the query is processed and returned to the user, and then the result set is dropped.

CTEs have limited scope. A CTE is only resident within a single query. Unlike temporary tables which will persist until the user disconnects from the server, a CTE disappears once the query has completed. This makes memory and table management a lot easier on both developers and DBAs.

CTEs can be recursive. A recursive CTE is a very powerful piece of functionality and can be used to retrieve complex hierarchies from a single table that might otherwise require multiple queries to retrieve.

CTEs offer better aggregation possibilities. Within a single query, it normally isn’t possible to produce aggregations on the output of non-deterministic functions. When the output of a non-deterministic function is included in a CTE, you can group by the output. Of course, you could also do this by including the function in a derived table, however the T-SQL to create the CTE ends up being much cleaner to read than using derived tables.

How to use a Common Table Expression

The basic structure of a CTE is very simple:

  WITH cte_name (col_a, col_b, ..., col_z)
  AS
  (
    -- query definition
  )
  SELECT col_a, col_b, ..., col_z
  FROM cte_name

First, it’s important to say that a CTE can be used with any type of query: INSERT, SELECT, UPDATE, DELETE. They are not limited to only SELECT statements.

Second, and very important, the query that immediately precedes a CTE definition has to be terminated with a semi-colon. For this reason you will often see CTEs written as:

;WITH cte_name

Third, the query that uses the CTE has to immediately follow the CTE definition. You can’t write a CTE at the top of a query batch and save it for later.

A quick example

This example requires that you have the AdventureWorks database installed. If you don’t, you can download it from codeplex. Make sure you’re using AdventureWorks and not AdventureWorks2008.

  /* non-CTE query */
  SELECT
  	AVG(OrdersPlaced)
  FROM (
  	SELECT
  		v.VendorID,
  		v.[Name] AS VendorName,
  		COUNT(*) AS OrdersPlaced
  	FROM Purchasing.PurchaseOrderHeader AS poh
  	INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
  	GROUP BY v.VendorID, v.[Name]
  ) AS x

  /* CTE query */
  WITH cte (VendorId, VendorName, OrdersPlaced)
  AS (
  	SELECT
  		v.VendorID,
  		v.[Name] AS VendorName,
  		COUNT(*) AS OrdersPlaced
  	FROM Purchasing.PurchaseOrderHeader AS poh
  	INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
  	GROUP BY v.VendorID, v.[Name]
  )
  SELECT
  	AVG(OrdersPlaced) AS AvgOrdersPlaced
  FROM cte;

The CTE definition retrieves the vendor id, name, and count of all orders grouped by the vendor id and name. That is to say it selects the number of orders for each vendor. Once this is collected and aggregated in the CTE definition, the average number of orders is computed.

If you were to take a look at the execution plans of both of the previous queries, there would be no difference. That’s because in this simple case there is no difference. This is a fairly contrived example to demonstrate the difference in syntax before jumping in. Let’s take a look at a less contrived example.

  /* non-CTE query */
  SELECT
  	(SELECT AVG(OrdersPlaced)
  	FROM (SELECT v.VendorID, v.[Name] AS VendorName, COUNT(*) AS OrdersPlaced
  			FROM Purchasing.PurchaseOrderHeader AS poh
  			INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
  			GROUP BY v.VendorID, v.[Name]) AS x) AS AvgOrdersPlaced,
  	(SELECT MAX(OrdersPlaced)
  	FROM (SELECT v.VendorID, v.[Name] AS VendorName, COUNT(*) AS OrdersPlaced
  			FROM Purchasing.PurchaseOrderHeader AS poh
  			INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
  			GROUP BY v.VendorID, v.[Name]) AS x) AS MaxOrdersPlaced,
  	(SELECT MIN(OrdersPlaced)
  	FROM (SELECT v.VendorID, v.[Name] AS VendorName, COUNT(*) AS OrdersPlaced
  			FROM Purchasing.PurchaseOrderHeader AS poh
  			INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
  			GROUP BY v.VendorID, v.[Name]) AS x) AS MinOrdersPlaced

  /* CTE query */
  WITH cte (VendorId, VendorName, OrdersPlaced)
  AS (
  	SELECT
  		v.VendorID,
  		v.[Name] AS VendorName,
  		COUNT(*) AS OrdersPlaced
  	FROM Purchasing.PurchaseOrderHeader AS poh
  	INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
  	GROUP BY v.VendorID, v.[Name]
  )
  SELECT
  	AVG(OrdersPlaced) AS AvgOrdersPlaced,
  	MAX(OrdersPlaced) AS MaxOrdersPlaced,
  	MIN(OrdersPlaced) AS MinOrdersPlaced
  FROM cte;
Multiple sub-queries means multiple steps to aggregate

Multiple sub-queries means multiple steps to aggregate

Already the non-CTE based query is starting to get ugly and complicated. There are three nested sub-selects. Each one is used to compute a separate aggregation of the data in the original query, which must be repeated in each query. Unfortunately this means that the query optimizer is going to be busy building multiple result sets and merging them together. This can rapidly become very complicated and produce poorly performing queries, not to mention become a maintenance nightmare.

A CTE uses only one step to perform multiple aggregates

A CTE uses only one step to perform multiple aggregates

The CTE query is still just as simple as the first query. The two new aggregations have been added and it only took two additional lines of T-SQL. If you take a look at the execution plan, this happens in the same step during query processing, vs the separate steps that are executed in the non-CTE query. While it would be nice for the query engine to notice what we’re doing in the first query, it doesn’t make any sense to force the optimizer to make up for poor coding standards.

Recursive Common Table Expressions

There’s one other great feature of CTEs: recursion. By referencing the CTE within the definition it’s possible to create a recursive query. A recursive CTE has two parts: the anchor member and the recursive member(s). The anchor must come before the recursive members.

Execution of a recursive CTE happens in three stages:

  1. The anchor member is evaluated
  2. The recursive member(s) are evaluated
  3. Rinse and repeat until a termination condition is met

There are, of course, a few more rules than this:

  • The anchor and recursive members have to be joined by UNION, UNION ALL, EXCEPT, or INTERSECT operators.
  • The FROM of each recursive member can only refer to the CTE expression.
  • You can’t use any of the following in a recursive CTE:
    • SELECT DISTINCT
    • GROUP BY
    • HAVING
    • TOP
    • LEFT, RIGHT, or OUTER JOIN (you can use an INNER JOIN)
    • A subquery
    • Scalar aggregation (AVG, MIN, MAX, COUNT, COUNT_BIG)
    • A query hint applied to a recursive reference to the CTE
    • Data types and columns must match

With all of those rules out of the way, let’s take a look at a recursive CTE. This one is straight out of Books Online:

  WITH DirectReports(ManagerID, EmployeeID, EmployeeLevel) AS
  (
    /* anchor */
    SELECT ManagerID, EmployeeID, 0 AS EmployeeLevel
    FROM HumanResources.Employee
    WHERE ManagerID IS NULL

    UNION ALL

    /* recursion */
    SELECT e.ManagerID, e.EmployeeID, EmployeeLevel + 1
    FROM HumanResources.Employee e
        INNER JOIN DirectReports d
        ON e.ManagerID = d.EmployeeID
  )
  SELECT ManagerID, EmployeeID, EmployeeLevel
  FROM DirectReports;

This query will retrieve a recursive list of all employees of AdventureWorks, starting at the CEO and going all the way down to the bottom of the org chart. It’s a relatively simple example, but it shows how to reference the anchor member within the recursive member.

It’s important to remember that CTEs can contain CTEs and that a recursive CTE can contain multiple references to the anchor member.

For more information on CTEs, please refer to Books Online:

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.