Category SQL

Using Indexed Views to Replace Triggers

Let’s get this out of the way: I think triggers are cool. I also think triggers have their place, they just need to be used with care. One reason for this is that triggers are synchronous. That is – trigger actions will block a command from returning to the client until the the trigger’s actions have completed.

What’s the solution? Do we run summary jobs on a regular basis? Maybe.

Sometimes summary jobs aren’t enough. Sometimes we have to update summarized data in real time. Let’s break out AdventureWorks and look at an example.

Let’s say we have a report that shows us the total line item sales per customer broken out by year and month.

That’s pretty simple, right?

SELECT  p.BusinessEntityID ,
        p.FirstName ,
        p.LastName ,
        DATEPART(yyyy, soh.OrderDate) AS [Year] ,
        DATEPART(mm, soh.OrderDate) AS [Month] ,
        SUM(sod.LineTotal) AS LineItemTotal
FROM    Person.Person AS p
        INNER JOIN Sales.Customer AS c ON c.PersonID = p.BusinessEntityID
        INNER JOIN Sales.SalesOrderHeader AS soh ON c.CustomerID = soh.CustomerID
        INNER JOIN Sales.SalesOrderDetail AS sod ON soh.SalesOrderID = sod.SalesOrderID
GROUP BY p.BusinessEntityID ,
        p.FirstName ,
        p.LastName ,
        DATEPART(yyyy, soh.OrderDate) ,
        DATEPART(mm, soh.OrderDate)
ORDER BY p.BusinessEntityID ,
        [Year] ,
        [Month] ;

The problem is that this query may use a lot of CPU or disk resources. How can we avoid that? Earlier I mentioned using summary jobs. The problem is that the summary jobs will only run on a schedule and require that we maintain some kind of maintenance history of each row – even if that maintenance history is just a modification timestamp. Wouldn’t it be easier if our summary report just updated itself automagically?

It can! Enter the indexed view.

An indexed view is a regular view that has a clustered index on it. By adding the clustered index the view is persisted to disk. The upside of persisting the indexed view to disk is that there’s an automatic mechanism in place that will update the indexed view. An indexed view must have a unique clustered index so that SQL Server can identify the materialized row that must be updated in response to changes in the underlying data.

CREATE VIEW dbo.my_indexed_view
WITH SCHEMABINDING
AS
  SELECT  p.BusinessEntityID,
          p.FirstName,
          p.LastName,
          DATEPART(yyyy, soh.OrderDate) AS [Year],
          DATEPART(mm, soh.OrderDate) AS [Month],
          SUM(sod.LineTotal) AS LineItemTotal,
          COUNT_BIG(*) AS NumberOfRecords
  FROM    Person.Person AS p
          INNER JOIN Sales.Customer AS c ON c.PersonID = p.BusinessEntityID
          INNER JOIN Sales.SalesOrderHeader AS soh ON c.CustomerID = soh.CustomerID
          INNER JOIN Sales.SalesOrderDetail AS sod ON soh.SalesOrderID = sod.SalesOrderID
  GROUP BY p.BusinessEntityID,
          p.FirstName,
          p.LastName,
          DATEPART(yyyy, soh.OrderDate),
          DATEPART(mm, soh.OrderDate) ;
GO

CREATE UNIQUE CLUSTERED INDEX CX_my_indexed_view ON dbo.my_indexed_view (BusinessEntityID, [Year], [Month]) ;

We could have accomplished this using a summary table and triggers. Triggers are synchronous and updating the summary table will block our data changes from completing. In addition, if we ever want to bypass the trigger from updating (say during a huge data load) we need to disable and then enable the trigger again and then remember to update the summary table. With an indexed view we can just drop the clustered index and then recreate it when we’re done with our load processes.

Note: I can’t take full credit for this question and explanation. A SQL Saturday #42 attendee sent me an email with a question about using indexed views instead of triggers.

Free Training – SQL Saturday 42

Good news! I’m speaking at SQL Saturday 42 this Saturday. Got nothing to do? Head on down to Goodwill Columbus at 1331 Edgehill Rd, Columbus, OH. Got something to do? Cancel it.

I’m excited about the presentations I’m giving – I haven’t given the indexing presentation in a long time and it should be a lot of fun. And the Dynamic SQL presentation is one of the first in the day. It’s a nice easy way (I think) to get your day started.

A Dynamic World Demands Dynamic SQL

Dynamic SQL is a misunderstood and much maligned part of a DBA’s tool kit – it can be used to solve difficult business problems, respond to diverse data needs, and alleviate performance problems. Many DBAs reject dynamic SQL outright as a potential source of SQL injections, being poorly performing, or just for being a hacky solution in general. Not so! Jeremiah Peschka has been making extensive use of dynamic SQL throughout his career to solve a variety of problems. He’ll set about dispelling these misconceptions and demonstrate how dynamic SQL can become a part of every DBA’s tool kit.

Indexes And Other Free Performance Boosts

The database is often viewed as a major performance bottleneck. There are a number of quick, easy, painless techniques that can increase the performance of an application not just by a small amount, but by orders of magnitude. These techniques includes simple indexing techniques, T-SQL techniques, and general database application design patterns that give great gains in performance. In this session, you will learn how to look at a database to identify these problem areas and how to resolve common issues that you will encounter.

What else?

Let’s say you’re interested in something else. What should you go see? Well, Michael Swart (blog | twitter) put together a nice little blog post about How I plan to spend my weekend.

If you like business intelligence, I suggest you hit up Dave Rodabaugh’s presentations. I cannot speak highly enough of Dave’s work. Not only is he one of the brightest BI people I know, he’s also been a teacher, friend, and mentor to me for a long time.

There will be some kind of dinner/drinks/whatever going on afterwards at Barley’s Smokehouse (map). I plan on being there for a little bit. Even if you can’t make it to the event, head on over there around 6:00PM. I’ll be there.

Shrink, Damn’d Log! Shrink, I Say!

Ever have a database log file grow to epic proportions on the weekend? I have. Ever forget to set up job monitoring on that server? I have. Ever get so pissed off that you decided to write a job to automatically shrink the log files whenever they grow? I have.

How I Analyzed the PASS 2010 Summit Submissions

Want to know how I analyzed the Summit session data? I exported the data from the abstract selection system into a tab delimited file. Since I use a Mac at home, I used the tools available to me: PostgreSQL.

I loaded the data using the PostgreSQL COPY command to bulk load the report output and then did some data mojo using PostgreSQL. Most of it was exactly the same as it would be for SQL Server with a few differences. Here’s the code that I used:

SELECT  COUNT(*) AS total_sessions
FROM    abstracts;
GO
-- Yup, I used the GO statement separator. This is a byproduct
-- of the client that I use, not because of PostgreSQL.

SELECT  COUNT(DISTINCT first_name || ' ' || last_name) AS name_count
FROM    abstracts ;
GO

SELECT  job_title,
        COUNT(job_title) AS jt_count
FROM    ( SELECT  job_title,
                  first_name,
                  last_name
          FROM    abstracts
          GROUP BY job_title,
                  first_name,
                  last_name
) AS x
GROUP BY job_title
ORDER BY jt_count DESC;
GO

WITH c AS (
  SELECT  job_title,
          first_name,
          last_name
  FROM    abstracts
  GROUP BY job_title,
          first_name,
          last_name
)
SELECT  SUM(CASE WHEN LOWER(job_title) LIKE '%engineer%' THEN 1 ELSE 0 END) AS engineer_count,
        SUM(CASE WHEN LOWER(job_title) LIKE '%manager%' THEN 1 ELSE 0 END) AS manager_count,
        SUM(CASE WHEN LOWER(job_title) LIKE '%developer%' THEN 1 ELSE 0 END) AS developer_count,
        SUM(CASE WHEN LOWER(job_title) LIKE '%consultant%' THEN 1 ELSE 0 END) AS consultant_count,
        SUM(CASE WHEN LOWER(job_title) LIKE '%business intelligence%' THEN 1
                 WHEN LOWER(job_title) LIKE '%b.i.%' THEN 1
                 WHEN LOWER(job_title) LIKE '%BI%' THEN 1
                 ELSE 0 END) AS bi_count,
        SUM(CASE WHEN LOWER(job_title) LIKE '%architect%' THEN 1 ELSE 0 END) AS architect_count
FROM    c;
GO

SELECT  COUNT(DISTINCT company) AS company_count
FROM    abstracts;
GO

SELECT  DISTINCT
        category,
        track,
        COUNT(track) OVER (PARTITION BY category) AS the_count_by_category,
        COUNT(track) OVER (PARTITION BY category, track) AS the_count_by_track
FROM    abstracts
ORDER BY category, track;
GO

SELECT  AVG(author_count) AS author_average
FROM    ( SELECT  DISTINCT
                  first_name,
                  last_name,
                  COUNT(session_title) OVER (PARTITION BY first_name, last_name) AS author_count
          FROM   abstracts
        ) AS x;
GO

-- You might be weirded out by this string_to_array and array_upper.
-- PostgreSQL has an array data type. We're using an array of strings
-- to get an accurate word count
SELECT  AVG(CHAR_LENGTH(abstract)) AS avg_char_count,
        STDDEV(CHAR_LENGTH(abstract)) AS char_count_stddev,
        AVG(array_upper(string_to_array(abstract, ' '), 1)) AS avg_word_count,
        STDDEV(array_upper(string_to_array(abstract, ' '), 1)) AS word_count_stddev
FROM    abstracts;
GO

-- Hey, look at that, we can use a single TRIM function to do all work
SELECT  SUM(CASE WHEN TRIM(BOTH FROM presented_before) = 'YES' THEN 1 ELSE 0 END) AS presented_before,
        SUM(CASE WHEN TRIM(BOTH FROM presented_before) = 'NO' THEN 1 ELSE 0 END) AS never_presented
FROM    abstracts;
GO

SELECT  level,
        COUNT(*)
FROM    abstracts
GROUP BY level
ORDER BY level;
GO

Push Ups and String Concatenation

SQL sucks at string manipulation. It’s not just SQL Server, it’s the SQL language in general. I’ve mentioned this before on Stack Overflow in answer to the question “What are five things you hate about your favorite language?” It’s almost easier to return the raw rows to application code than it is to work with them inside of a database. There’s one place where SQL Server has made it easier to work with string data – string building.

Encrypted Stored Procedures and Their Effect on my Rug

This is a letter, or a rant, to any ISVs in the world who encrypt their stored procedures. It is, by no means, a condemnation of this horrible feature of SQL Server. The feature condemns itself.

I saw something in a blog the other day (that I wish I could find it again) where the author essentially said that encrypting stored procedures is an absolute necessity. I couldn’t help but incredulously think “Really? What part of your code is so cunning that I can’t see it?”

Yes, I understand that someone might comment here and say “Well, certain encryption algorithms might need to be encrypted in stored procedures so that people don’t discover and break them.” To which I must respond: grow up. Nobody is writing encryption code in T-SQL. If they are, I don’t want to read it.

The fact is, Mr. ISV, your ideas are not so precious or original. I don’t have a degree in Computer Science, but I do know the basics of algorithms and design patterns. There are only so many ways to skin a particular cat. I can watch the behavior of my SQL Server and figure out just how craptacular your code is and make a pretty educated guess that you’re using cursors written by someone whose understanding of programming concepts stopped when VB was a threat to PowerBuilder.

Please, for all of us, just stop. Your insistence that your secret sauce is important is laughable. Nobody cares. Your competitors don’t care. They’re either too busy catching up or else too busy lighting cigars with hundred dollar bills to care. Nobody that I work with is going to spend their free time reverse engineering your software so we can stop paying maintenance fees – we pay you because paying you is cheaper than doing this work in house.

When you encrypt your stored procedures, I can’t help you make it better. It’s like you only come home drunk, throw microwave burritos at the cat, crap on the rug, and then leave. There’s no possibility for discourse about what you could do better. I don’t want to talk about what you’re doing wrong, I want to make this deal sweet for both of us – I want to make my system faster and help you make your other customers’ systems faster.

Please, for the sake of me and your other customers, stop shitting on my rug.

Rounding to the Nearest X Minutes, the Lazy Way

A lot of people use calendar tables. I’ve blogged about it before. They’re incredibly helpful. Now, have you ever needed a table of minutes?

You’re probably asking, “Jeremiah, why the heck would I ever need a table of minutes?” Well, dear reader, I’m going to tell you that. Please stop interrupting.

Let’s say you have a report. This report shows the sum of sales per 5 minute increment. You could do a lot of trickery with math to make this report happen, doing things like this to get the 5 minute interval:

SELECT
    (DATEPART(mi, CAST('2009-01-01 00:01:00' as datetime))+4) / 5 * 5 AS [05]

It works, but it’s ugly as sin. I said to myself, “Self, there has to be a better way.” Turns out that there is a better way: table valued functions!

I created a table valued function to return 60 rows, one for each minute. It also rounds to the nearest 5, 10, 15, and 30 minute intervals. This makes it possible to change the reporting interval very easily by using a join. Check it out:

SELECT DATEPART(hh, s.SaleTime) AS TheHour,
  mt.Five AS NearestFive,
  SUM(s.SalesAmount) AS AmountSold
FROM dbo.Sales AS s
INNER JOIN dbo.MinutesTable() AS mt ON DATEPART(mi, s.SaleTime) = mt.[Minute]
WHERE -- something is true
GROUP BY DATEPART(hh, s.SaleTime), mt.Five

Why do I call this the lazy way to do this? Because now that I’ve written it once I never have to do it again. I can add new columns to dbo.MinutesTable() without having to worry about breaking anything or copying code incorrectly from one query to another.

Here’s the code to create the function:

CREATE FUNCTION dbo.MinutesTable ()
RETURNS @minutes TABLE (
  [Minute] TINYINT,
  Five TINYINT,
  Ten TINYINT,
  Fifteen TINYINT,
  Thirty TINYINT
)
AS BEGIN

INSERT INTO @minutes VALUES (0,60,60,60,60)
                            (1,5,10,15,30),
                            (2,5,10,15,30),
                            /* you get the gist of it */
                            (59,60,60,60,60);

RETURN ;

END

Update: Thanks to Brad Schulz this has gone for a single post about my laziness to an example of a refactoring you can make. If you ever see a table being generated like this, you can take a look at it and determine how you can change it into an inline select. The query optimizer is going to do something completely different for each plan. With the inline select, SQL Server is able to determine that there are 60 rows in our table and can build a much more efficient execution plan.

CREATE FUNCTION [dbo].[MinutesTable] ()
RETURNS TABLE
AS 

RETURN
  SELECT  0 AS [Minute], 60 AS Five, 50 AS Ten, 60 AS Fifteen, 60 AS Thirty
  UNION ALL
  SELECT  1, 5, 10, 15, 30
  UNION ALL
  /* ... this still makes sense, right? ... */
  UNION ALL
  SELECT  59, 60, 60, 60, 60 ;
GO

A Simple Refactoring – Functions in the WHERE Clause

Putting functions in the where clause of a SQL Statement can cause performance problems. If you aren’t careful, these problems can add up and bring a powerful production system to its knees.

For this example, I’m using the Northwind database, but you could do this on any database.

The first thing to do is put an index on the OrderDate column:

CREATE INDEX IX_Orders_OrderDate
ON dbo.Orders (OrderDate) ;

Take a look at this first query:

SELECT  COUNT(*)
FROM    dbo.Orders AS o
WHERE   DATEADD(MM, 2, o.OrderDate) < '1997-05-04' ;

Scanning the stars... or something.

This returns very quickly on the Northwind database because we have such a small volume of data, but a query like this could cause a lot of problems in production. Why? Well, it’s doing a full table scan. Don’t believe me? Take a look a the execution plan over there.

The database is forcing a table scan because of the use of the DATEADD function on an indexed column. SQL Server will have to compute the value of every OrderDate plus 2 months in the database and then compare it to May 4th, 1997.

What happens if we flip this around? What if we change our query so that we’re doing the math on our input value of May 4th, 1997?

SELECT  COUNT(*)
FROM    dbo.Orders AS o
WHERE   o.OrderDate < DATEADD(MM, -2, '1997-05-04') ;

Seek and ye shall find

It turns out that if we move the function from the table to the variable (although it’s really inlined in this case) we can get SQL Server to perform an index seek. This is going to be orders of magnitude faster than scanning every row in the table and comparing the values.

So, there you have it: another easy fix that can save you a lot of headache when you’re trying to tune queries.

Installing PostgreSQL on Mac OS X

This is a pretty simple process, but one that I thought I would document because I ran into a few gotchas along the way.

I originally installed OS X using the one click installer from EnterpriseDB. Unfortunately, the installer hung while attempting to finish the installation process and they only thing to do was to roll back the install. I attempted to build from macports, but that proved to be a huge pain in my ass and reminded me a little bit too much of using Linux, so I scrapped that idea as well. I started browsing the EnterpriseDB installation notes and came across an unattended install. Since my installation failed trying to launch the final interactive stage of the installation, I said to myself “Self, I bet we should try this.”

  1. Download PostgreSQL from EnterpriseDB http://www.enterprisedb.com/products/pgdownload.do#osx
  2. Mount the OS X disk image file
  3. In the Terminal, change to the directory where the image file was mounted:
    cd /Volumes/PostgreSQL\ 8.4.2-1/
  4. Start the unattended installation
    sudo ./postgresql-8.4.2-1-osx.app/Contents/MacOS/installbuilder.sh
    --mode unattended
  5. Once the installation has finished, add the following line to your .profile:
    source /Library/PostgreSQL/8.4/pg_env.sh
    If you’ve never created your .profile you can do the following in a terminal window:

    1. touch ~/.profile
      This will create the file if it doesn’t exist and will only change the last modification date if it does.
    2. open /Applications/TextEdit.app .profile
      This opens your .profile in the TextEdit application
  6. Save and close your .profile
  7. Create a new database user.
    createuser [YOUR NAME HERE] --pwprompt --username=postgres
    Alt text for the image, e.g. “The Mona Lisa”

    This could be anyone...

  8. When you’re prompted for the “Password:”, use “postgres”. This is the default password for the postgres superuser. Because you’ve just created another superuser, you’ll never have to touch this account again, but you should probably look up some fantastic documentation and learn how to change the password.

    I'm a real live boy!

There you have it: how to set up and install PostgreSQL on OS X. If you want to continue forward and install a Ruby library to interact with PostgreSQL, just type
sudo gem install postgres-pr to install the pure Ruby version PostgreSQL driver. If you want to install one of the C-based drivers, you’ll need to mess around with macports.

Note: EnterpriseDB make and distribute a binary version of PostgreSQL for a variety of platforms. In addition, they provide paid support and also offer a variety of enterprise features. They have in no way paid me for this post.


Related Posts:

PostgreSQL Tutorial – Creating Your First Database
PostgreSQL Tutorial – Creating the Pagila Schema
PostgreSQL Tutorial – Referring to Other Tables
PostgreSQL Tutorial – Inserting Data

A Simple Refactoring – Avoiding Table Scans

Refactoring SQL code can be pretty easy. Just like refactoring any other programming language, sometimes you have to look around to find the culprit.

We have a slow running query that frequently times out when run by the users. A quick look at the query told me what was wrong. I found this clause in the middle of an otherwise simple query:

        LEFT JOIN (
                   SELECT DISTINCT
                          key,
                          column_a
                   FROM   ImportantData
                   WHERE  FieldName = 'VALUES'
                   UNION ALL
                   SELECT DISTINCT
                          key,
                          column_a
                   FROM   ImportantData_History
                   WHERE  FieldName = 'VALUES'
                  ) AS t ON other.key = t.key
        LEFT JOIN (
                   SELECT DISTINCT
                          key,
                          column_b
                   FROM   ImportantData
                   WHERE  FieldName = 'VALUESXY'
                   UNION ALL
                   SELECT DISTINCT
                          key,
                          column_b
                   FROM   ImportantData_History
                   WHERE  FieldName = 'VALUESXY'
                  ) AS t2 ON other.key = t2.key

The problem with this query is that we’re reading each table twice for the same data. I checked in the execution plan and, sure enough, SQL Server was performing two scans on the underlying data. After some careful thought I realized that I could accomplish the same thing with a single query:

        LEFT JOIN (
                   SELECT key,
                          VALUES AS column_a,
                          VALUESXY AS column_b
                   FROM   (
                           SELECT DISTINCT
                                  key,
                                  FieldName,
                                  CASE WHEN FieldName = 'VALUES'
                                       THEN column_a
                                       WHEN FieldName = 'VALUESXY'
                                       THEN column_b
                                  END AS content
                           FROM   ImportantData
                           WHERE  FieldName = 'VALUES'
                                  OR FieldName = 'VALUESXY'
                           UNION ALL
                           SELECT DISTINCT
                                  key,
                                  FieldName,
                                  CASE WHEN FieldName = 'VALUES'
                                       THEN column_a
                                       WHEN FieldName = 'VALUESXY'
                                       THEN column_b
                                  END AS content
                           FROM   ImportantData_History
                           WHERE  FieldName = 'VALUES'
                                  OR FieldName = 'VALUESXY'
                          ) AS source PIVOT( MAX(CONTENT) FOR FieldName IN ([VALUES],
                                                              [VALUESXY]) ) AS pvt
                  ) AS t ON other.key= t.key

The upside to this unreadable pile of junk is that it scans each table once and only once. This reduces the load on disk, the amount of data that needs to be read into memory, and it most likely reduces the amount of data stored in memory. And, while this only reduced the estimated query cost by 40 (580 down to 540), we all know that the query cost is only one factor in what actually affects the overall run time of a query.

Many readers are probably mortified that there’s a SELECT DISTINCT in this query. I am too. That being said, I plucked the low-hanging fruit and managed to get a huge performance boost as a result – query execution time went from over 1 minute to 14 seconds. With an improvement like that, removing that distinct isn’t on the radar any more.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 219 items have been purified.