Using Partitioning Functions to Find Duplicate Rows

Did you know that you can use the OVER clause, in conjunction with PARTITION BY to find duplicate rows in the database?

Here’s a business case for you:

There is a table consisting of UserId, SiteId, and ClientId:

IF OBJECT_ID('window_test') IS NOT NULL
  DROP TABLE window_test;

CREATE TABLE window_test
(
  UserId INT,
  SiteId INT,
  ClientId INT
);

Over time, data has been inserted into this table both through an application and through SSMS:

INSERT INTO window_test
SELECT 1, 1, 1
UNION ALL
SELECT 1, 2, 2
UNION ALL
SELECT 2, 1, 1
UNION ALL
SELECT 3, 1, 1
UNION ALL
SELECT 1, 3, 1;

SELECT * FROM window_test;

Unfortunately, what happened here is that the second row for user 1 (1, 2, 2) should not have been entered. There’s a business rule in place that a user should only belong to one client. We know that this user exists, but how many more users exist?

Well, the first way to test for this that most people would immediately reach for is to use a distinct count:

SELECT UserId, 
       COUNT(DISTINCT ClientId) AS user_count
  FROM window_test
 GROUP BY UserId;

And this works perfectly. But the point of this is to show you a different way to do things. You can also accomplish the same thing using OVER and PARTITION BY:

SELECT UserId, 
       COUNT(*) OVER (PARTITION BY UserId) AS user_count
  FROM window_test
 GROUP BY UserId, ClientId;

Unfortunately, this gives two results for the same user with the same count! Well, the easy way around that is to use a GROUP BY in an outer query, like so:

SELECT *
  FROM (SELECT UserId, 
               COUNT(*) OVER (PARTITION BY UserId) AS user_count
          FROM window_test
         GROUP BY UserId, ClientId) AS x
 GROUP BY UserId, user_count;

Yes, I know I shouldn’t use a SELECT * in a query. I’m just trying to illustrate a point. Do as I say, not as I do.

Admittedly, for this situation, using a distinct count makes much more sense and is far less convoluted. However, there are many other situations where it’s preferable (if not necessary) to write a query using COUNT(*) OVER (…) instead of using convoluted logic or subqueries to return a row count.

Update: In the original comments, Mladen Prajdic suggests using checksum for duplicate detection.

Menu