A Simple Refactoring – Avoiding Table Scans
Refactoring SQL code can be pretty easy. Just like refactoring any other programming language, sometimes you have to look around to find the culprit.
We have a slow running query that frequently times out when run by the users. A quick look at the query told me what was wrong. I found this clause in the middle of an otherwise simple query:
LEFT JOIN ( SELECT DISTINCT key, column_a FROM ImportantData WHERE FieldName = 'VALUES' UNION ALL SELECT DISTINCT key, column_a FROM ImportantData_History WHERE FieldName = 'VALUES' ) AS t ON other.key = t.key LEFT JOIN ( SELECT DISTINCT key, column_b FROM ImportantData WHERE FieldName = 'VALUESXY' UNION ALL SELECT DISTINCT key, column_b FROM ImportantData_History WHERE FieldName = 'VALUESXY' ) AS t2 ON other.key = t2.key
The problem with this query is that we’re reading each table twice for the same data. I checked in the execution plan and, sure enough, SQL Server was performing two scans on the underlying data. After some careful thought I realized that I could accomplish the same thing with a single query:
LEFT JOIN ( SELECT key, VALUES AS column_a, VALUESXY AS column_b FROM ( SELECT DISTINCT key, FieldName, CASE WHEN FieldName = 'VALUES' THEN column_a WHEN FieldName = 'VALUESXY' THEN column_b END AS content FROM ImportantData WHERE FieldName = 'VALUES' OR FieldName = 'VALUESXY' UNION ALL SELECT DISTINCT key, FieldName, CASE WHEN FieldName = 'VALUES' THEN column_a WHEN FieldName = 'VALUESXY' THEN column_b END AS content FROM ImportantData_History WHERE FieldName = 'VALUES' OR FieldName = 'VALUESXY' ) AS source PIVOT( MAX(CONTENT) FOR FieldName IN ([VALUES], [VALUESXY]) ) AS pvt ) AS t ON other.key= t.key
The upside to this unreadable pile of junk is that it scans each table once and only once. This reduces the load on disk, the amount of data that needs to be read into memory, and it most likely reduces the amount of data stored in memory. And, while this only reduced the estimated query cost by 40 (580 down to 540), we all know that the query cost is only one factor in what actually affects the overall run time of a query.
Many readers are probably mortified that there’s a SELECT DISTINCT in this query. I am too. That being said, I plucked the low-hanging fruit and managed to get a huge performance boost as a result – query execution time went from over 1 minute to 14 seconds. With an improvement like that, removing that distinct isn’t on the radar any more.
This is Jeremiah
I live in Portland, OR. I have two dogs.
I recently received a Master's of Science in Computer Science from Portland State University.
I'm was Microsoft MVP from 2009 - 2018 with a pile of certifications. Somewhere along the way, I wrote a database client for Riak and then handed it off to the community.