Using Partitioning Functions to Find Duplicate Rows
Did you know that you can use the OVER clause, in conjunction with PARTITION BY to find duplicate rows in the database? Here’s a business case for you: There is a table consisting of UserId, SiteId, and ClientId:``` IF OBJECT_ID(‘window_test’) IS NOT NULL DROP TABLE window_test;
CREATE TABLE window_test ( UserId INT, SiteId INT, ClientId INT );
Over time, data has been inserted into this table both through an application and through SSMS:
INSERT INTO window_test
SELECT 1, 1, 1
UNION ALL
SELECT 1, 2, 2
UNION ALL
SELECT 2, 1, 1
UNION ALL
SELECT 3, 1, 1
UNION ALL
SELECT 1, 3, 1;
SELECT * FROM window_test;
Unfortunately, what happened here is that the second row for user 1 (1, 2, 2) should not have been entered. There’s a business rule in place that a user should only belong to one client. We know that this user exists, but how many more users exist? Well, the first way to test for this that most people would immediately reach for is to use a distinct count:
SELECT UserId,
COUNT(DISTINCT ClientId) AS user_count
FROM window_test
GROUP BY UserId;
And this works perfectly. But the point of this is to show you a different way to do things. You can also accomplish the same thing using OVER and PARTITION BY:
SELECT UserId,
COUNT(*) OVER (PARTITION BY UserId) AS user_count
FROM window_test
GROUP BY UserId, ClientId;
Unfortunately, this gives two results for the same user with the same count! Well, the easy way around that is to use a GROUP BY in an outer query, like so:
SELECT *
FROM (SELECT UserId,
COUNT(*) OVER (PARTITION BY UserId) AS user_count
FROM window_test
GROUP BY UserId, ClientId) AS x
GROUP BY UserId, user_count;