A Simple Refactoring – Avoiding Table Scans

Refactoring SQL code can be pretty easy. Just like refactoring any other programming language, sometimes you have to look around to find the culprit.

We have a slow running query that frequently times out when run by the users. A quick look at the query told me what was wrong. I found this clause in the middle of an otherwise simple query:

        LEFT JOIN (
                   SELECT DISTINCT
                          key,
                          column_a
                   FROM   ImportantData
                   WHERE  FieldName = 'VALUES'
                   UNION ALL
                   SELECT DISTINCT
                          key,
                          column_a
                   FROM   ImportantData_History
                   WHERE  FieldName = 'VALUES'
                  ) AS t ON other.key = t.key
        LEFT JOIN (
                   SELECT DISTINCT
                          key,
                          column_b
                   FROM   ImportantData
                   WHERE  FieldName = 'VALUESXY'
                   UNION ALL
                   SELECT DISTINCT
                          key,
                          column_b
                   FROM   ImportantData_History
                   WHERE  FieldName = 'VALUESXY'
                  ) AS t2 ON other.key = t2.key

The problem with this query is that we’re reading each table twice for the same data. I checked in the execution plan and, sure enough, SQL Server was performing two scans on the underlying data. After some careful thought I realized that I could accomplish the same thing with a single query:

        LEFT JOIN (
                   SELECT key,
                          VALUES AS column_a,
                          VALUESXY AS column_b
                   FROM   (
                           SELECT DISTINCT
                                  key,
                                  FieldName,
                                  CASE WHEN FieldName = 'VALUES'
                                       THEN column_a
                                       WHEN FieldName = 'VALUESXY'
                                       THEN column_b
                                  END AS content
                           FROM   ImportantData
                           WHERE  FieldName = 'VALUES'
                                  OR FieldName = 'VALUESXY'
                           UNION ALL
                           SELECT DISTINCT
                                  key,
                                  FieldName,
                                  CASE WHEN FieldName = 'VALUES'
                                       THEN column_a
                                       WHEN FieldName = 'VALUESXY'
                                       THEN column_b
                                  END AS content
                           FROM   ImportantData_History
                           WHERE  FieldName = 'VALUES'
                                  OR FieldName = 'VALUESXY'
                          ) AS source PIVOT( MAX(CONTENT) FOR FieldName IN ([VALUES],
                                                              [VALUESXY]) ) AS pvt
                  ) AS t ON other.key= t.key

The upside to this unreadable pile of junk is that it scans each table once and only once. This reduces the load on disk, the amount of data that needs to be read into memory, and it most likely reduces the amount of data stored in memory. And, while this only reduced the estimated query cost by 40 (580 down to 540), we all know that the query cost is only one factor in what actually affects the overall run time of a query.

Many readers are probably mortified that there’s a SELECT DISTINCT in this query. I am too. That being said, I plucked the low-hanging fruit and managed to get a huge performance boost as a result – query execution time went from over 1 minute to 14 seconds. With an improvement like that, removing that distinct isn’t on the radar any more.

Menu