Using Partitioning Functions to Find Duplicate Rows

April 21, 2009

Did you know that you can use the OVER clause, in conjunction with PARTITION BY to find duplicate rows in the database? Here’s a business case for you: There is a table consisting of UserId, SiteId, and ClientId:``` IF OBJECT_ID(‘window_test’) IS NOT NULL DROP TABLE window_test;

CREATE TABLE window_test ( UserId INT, SiteId INT, ClientId INT );

Over time, data has been inserted into this table both through an application and through SSMS: INSERT INTO window_test SELECT 1, 1, 1 UNION ALL SELECT 1, 2, 2 UNION ALL SELECT 2, 1, 1 UNION ALL SELECT 3, 1, 1 UNION ALL SELECT 1, 3, 1;

SELECT * FROM window_test;

Unfortunately, what happened here is that the second row for user 1 (1, 2, 2) should not have been entered. There’s a business rule in place that a user should only belong to one client. We know that this user exists, but how many more users exist? Well, the first way to test for this that most people would immediately reach for is to use a distinct count: SELECT UserId, COUNT(DISTINCT ClientId) AS user_count FROM window_test GROUP BY UserId;

And this works perfectly. But the point of this is to show you a different way to do things. You can also accomplish the same thing using OVER and PARTITION BY: SELECT UserId, COUNT(*) OVER (PARTITION BY UserId) AS user_count FROM window_test GROUP BY UserId, ClientId;

Unfortunately, this gives two results for the same user with the same count! Well, the easy way around that is to use a GROUP BY in an outer query, like so: SELECT * FROM (SELECT UserId, COUNT(*) OVER (PARTITION BY UserId) AS user_count FROM window_test GROUP BY UserId, ClientId) AS x GROUP BY UserId, user_count;