Using Partitioning Functions to Find Duplicate Rows
Did you know that you can use the OVER clause, in conjunction with PARTITION BY to find duplicate rows in the database?
Here’s a business case for you:
There is a table consisting of UserId, SiteId, and ClientId:
IF OBJECT_ID('window_test') IS NOT NULL DROP TABLE window_test; CREATE TABLE window_test ( UserId INT, SiteId INT, ClientId INT );
Over time, data has been inserted into this table both through an application and through SSMS:
INSERT INTO window_test SELECT 1, 1, 1 UNION ALL SELECT 1, 2, 2 UNION ALL SELECT 2, 1, 1 UNION ALL SELECT 3, 1, 1 UNION ALL SELECT 1, 3, 1; SELECT * FROM window_test;
Unfortunately, what happened here is that the second row for user 1 (1, 2, 2) should not have been entered. There’s a business rule in place that a user should only belong to one client. We know that this user exists, but how many more users exist?
Well, the first way to test for this that most people would immediately reach for is to use a distinct count:
SELECT UserId, COUNT(DISTINCT ClientId) AS user_count FROM window_test GROUP BY UserId;
And this works perfectly. But the point of this is to show you a different way to do things. You can also accomplish the same thing using OVER and PARTITION BY:
SELECT UserId, COUNT(*) OVER (PARTITION BY UserId) AS user_count FROM window_test GROUP BY UserId, ClientId;
Unfortunately, this gives two results for the same user with the same count! Well, the easy way around that is to use a GROUP BY in an outer query, like so:
SELECT * FROM (SELECT UserId, COUNT(*) OVER (PARTITION BY UserId) AS user_count FROM window_test GROUP BY UserId, ClientId) AS x GROUP BY UserId, user_count;
Yes, I know I shouldn’t use a SELECT * in a query. I’m just trying to illustrate a point. Do as I say, not as I do.
Admittedly, for this situation, using a distinct count makes much more sense and is far less convoluted. However, there are many other situations where it’s preferable (if not necessary) to write a query using COUNT(*) OVER (…) instead of using convoluted logic or subqueries to return a row count.
Update: In the original comments, Mladen Prajdic suggests using checksum for duplicate detection.
This is Jeremiah
I live in Portland, OR. I have two dogs.
I recently received a Master's of Science in Computer Science from Portland State University.
I'm was Microsoft MVP from 2009 - 2018 with a pile of certifications. Somewhere along the way, I wrote a database client for Riak and then handed it off to the community.