Tag sql refactoring

I’m Presenting at SQL Saturday 67

No, this isn’t a re-run! I’ll be presenting about Refacatoring SQL at SQL Saturday 67 in Chicago this coming Saturday.

I’m really excited about this opportunity. I had a blast presenting in Chicago last year and I’m looking forward to doing it again this year. There’s a great line up of speakers. If you’re in the Chicago area and want to get your learn on, I suggest you swing on by the DeVry Addison campus and check it out.

Here’s the title and abstract:

Refactoring SQL

Refactoring SQL is not like refactoring application code. This talk will demonstrate proven SQL refactoring techniques that will help you identify where performance gains can be made, apply quick fixes, improve readability, and help you quickly locate places to make sweeping performance improvements. Jeremiah Peschka has years of hands on experience tuning SQL applications for performance, throughput, and concurrency.

Rounding to the Nearest X Minutes, the Lazy Way

A lot of people use calendar tables. I’ve blogged about it before. They’re incredibly helpful. Now, have you ever needed a table of minutes?

You’re probably asking, “Jeremiah, why the heck would I ever need a table of minutes?” Well, dear reader, I’m going to tell you that. Please stop interrupting.

Let’s say you have a report. This report shows the sum of sales per 5 minute increment. You could do a lot of trickery with math to make this report happen, doing things like this to get the 5 minute interval:

SELECT
    (DATEPART(mi, CAST('2009-01-01 00:01:00' as datetime))+4) / 5 * 5 AS [05]

It works, but it’s ugly as sin. I said to myself, “Self, there has to be a better way.” Turns out that there is a better way: table valued functions!

I created a table valued function to return 60 rows, one for each minute. It also rounds to the nearest 5, 10, 15, and 30 minute intervals. This makes it possible to change the reporting interval very easily by using a join. Check it out:

SELECT DATEPART(hh, s.SaleTime) AS TheHour,
  mt.Five AS NearestFive,
  SUM(s.SalesAmount) AS AmountSold
FROM dbo.Sales AS s
INNER JOIN dbo.MinutesTable() AS mt ON DATEPART(mi, s.SaleTime) = mt.[Minute]
WHERE -- something is true
GROUP BY DATEPART(hh, s.SaleTime), mt.Five

Why do I call this the lazy way to do this? Because now that I’ve written it once I never have to do it again. I can add new columns to dbo.MinutesTable() without having to worry about breaking anything or copying code incorrectly from one query to another.

Here’s the code to create the function:

CREATE FUNCTION dbo.MinutesTable ()
RETURNS @minutes TABLE (
  [Minute] TINYINT,
  Five TINYINT,
  Ten TINYINT,
  Fifteen TINYINT,
  Thirty TINYINT
)
AS BEGIN

INSERT INTO @minutes VALUES (0,60,60,60,60)
                            (1,5,10,15,30),
                            (2,5,10,15,30),
                            /* you get the gist of it */
                            (59,60,60,60,60);

RETURN ;

END

Update: Thanks to Brad Schulz this has gone for a single post about my laziness to an example of a refactoring you can make. If you ever see a table being generated like this, you can take a look at it and determine how you can change it into an inline select. The query optimizer is going to do something completely different for each plan. With the inline select, SQL Server is able to determine that there are 60 rows in our table and can build a much more efficient execution plan.

CREATE FUNCTION [dbo].[MinutesTable] ()
RETURNS TABLE
AS 

RETURN
  SELECT  0 AS [Minute], 60 AS Five, 50 AS Ten, 60 AS Fifteen, 60 AS Thirty
  UNION ALL
  SELECT  1, 5, 10, 15, 30
  UNION ALL
  /* ... this still makes sense, right? ... */
  UNION ALL
  SELECT  59, 60, 60, 60, 60 ;
GO

A Simple Refactoring – Functions in the WHERE Clause

Putting functions in the where clause of a SQL Statement can cause performance problems. If you aren’t careful, these problems can add up and bring a powerful production system to its knees.

For this example, I’m using the Northwind database, but you could do this on any database.

The first thing to do is put an index on the OrderDate column:

CREATE INDEX IX_Orders_OrderDate
ON dbo.Orders (OrderDate) ;

Take a look at this first query:

SELECT  COUNT(*)
FROM    dbo.Orders AS o
WHERE   DATEADD(MM, 2, o.OrderDate) < '1997-05-04' ;

Scanning the stars... or something.

This returns very quickly on the Northwind database because we have such a small volume of data, but a query like this could cause a lot of problems in production. Why? Well, it’s doing a full table scan. Don’t believe me? Take a look a the execution plan over there.

The database is forcing a table scan because of the use of the DATEADD function on an indexed column. SQL Server will have to compute the value of every OrderDate plus 2 months in the database and then compare it to May 4th, 1997.

What happens if we flip this around? What if we change our query so that we’re doing the math on our input value of May 4th, 1997?

SELECT  COUNT(*)
FROM    dbo.Orders AS o
WHERE   o.OrderDate < DATEADD(MM, -2, '1997-05-04') ;

Seek and ye shall find

It turns out that if we move the function from the table to the variable (although it’s really inlined in this case) we can get SQL Server to perform an index seek. This is going to be orders of magnitude faster than scanning every row in the table and comparing the values.

So, there you have it: another easy fix that can save you a lot of headache when you’re trying to tune queries.

A Simple Refactoring – Avoiding Table Scans

Refactoring SQL code can be pretty easy. Just like refactoring any other programming language, sometimes you have to look around to find the culprit.

We have a slow running query that frequently times out when run by the users. A quick look at the query told me what was wrong. I found this clause in the middle of an otherwise simple query:

        LEFT JOIN (
                   SELECT DISTINCT
                          key,
                          column_a
                   FROM   ImportantData
                   WHERE  FieldName = 'VALUES'
                   UNION ALL
                   SELECT DISTINCT
                          key,
                          column_a
                   FROM   ImportantData_History
                   WHERE  FieldName = 'VALUES'
                  ) AS t ON other.key = t.key
        LEFT JOIN (
                   SELECT DISTINCT
                          key,
                          column_b
                   FROM   ImportantData
                   WHERE  FieldName = 'VALUESXY'
                   UNION ALL
                   SELECT DISTINCT
                          key,
                          column_b
                   FROM   ImportantData_History
                   WHERE  FieldName = 'VALUESXY'
                  ) AS t2 ON other.key = t2.key

The problem with this query is that we’re reading each table twice for the same data. I checked in the execution plan and, sure enough, SQL Server was performing two scans on the underlying data. After some careful thought I realized that I could accomplish the same thing with a single query:

        LEFT JOIN (
                   SELECT key,
                          VALUES AS column_a,
                          VALUESXY AS column_b
                   FROM   (
                           SELECT DISTINCT
                                  key,
                                  FieldName,
                                  CASE WHEN FieldName = 'VALUES'
                                       THEN column_a
                                       WHEN FieldName = 'VALUESXY'
                                       THEN column_b
                                  END AS content
                           FROM   ImportantData
                           WHERE  FieldName = 'VALUES'
                                  OR FieldName = 'VALUESXY'
                           UNION ALL
                           SELECT DISTINCT
                                  key,
                                  FieldName,
                                  CASE WHEN FieldName = 'VALUES'
                                       THEN column_a
                                       WHEN FieldName = 'VALUESXY'
                                       THEN column_b
                                  END AS content
                           FROM   ImportantData_History
                           WHERE  FieldName = 'VALUES'
                                  OR FieldName = 'VALUESXY'
                          ) AS source PIVOT( MAX(CONTENT) FOR FieldName IN ([VALUES],
                                                              [VALUESXY]) ) AS pvt
                  ) AS t ON other.key= t.key

The upside to this unreadable pile of junk is that it scans each table once and only once. This reduces the load on disk, the amount of data that needs to be read into memory, and it most likely reduces the amount of data stored in memory. And, while this only reduced the estimated query cost by 40 (580 down to 540), we all know that the query cost is only one factor in what actually affects the overall run time of a query.

Many readers are probably mortified that there’s a SELECT DISTINCT in this query. I am too. That being said, I plucked the low-hanging fruit and managed to get a huge performance boost as a result – query execution time went from over 1 minute to 14 seconds. With an improvement like that, removing that distinct isn’t on the radar any more.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 219 items have been purified.