February 2010
Mon Tue Wed Thu Fri Sat Sun
« Jan   Mar »
1234567
891011121314
15161718192021
22232425262728

Month February 2010

Rounding to the Nearest X Minutes, the Lazy Way

A lot of people use calendar tables. I’ve blogged about it before. They’re incredibly helpful. Now, have you ever needed a table of minutes?

You’re probably asking, “Jeremiah, why the heck would I ever need a table of minutes?” Well, dear reader, I’m going to tell you that. Please stop interrupting.

Let’s say you have a report. This report shows the sum of sales per 5 minute increment. You could do a lot of trickery with math to make this report happen, doing things like this to get the 5 minute interval:

SELECT
    (DATEPART(mi, CAST('2009-01-01 00:01:00' as datetime))+4) / 5 * 5 AS [05]

It works, but it’s ugly as sin. I said to myself, “Self, there has to be a better way.” Turns out that there is a better way: table valued functions!

I created a table valued function to return 60 rows, one for each minute. It also rounds to the nearest 5, 10, 15, and 30 minute intervals. This makes it possible to change the reporting interval very easily by using a join. Check it out:

SELECT DATEPART(hh, s.SaleTime) AS TheHour,
  mt.Five AS NearestFive,
  SUM(s.SalesAmount) AS AmountSold
FROM dbo.Sales AS s
INNER JOIN dbo.MinutesTable() AS mt ON DATEPART(mi, s.SaleTime) = mt.[Minute]
WHERE -- something is true
GROUP BY DATEPART(hh, s.SaleTime), mt.Five

Why do I call this the lazy way to do this? Because now that I’ve written it once I never have to do it again. I can add new columns to dbo.MinutesTable() without having to worry about breaking anything or copying code incorrectly from one query to another.

Here’s the code to create the function:

CREATE FUNCTION dbo.MinutesTable ()
RETURNS @minutes TABLE (
  [Minute] TINYINT,
  Five TINYINT,
  Ten TINYINT,
  Fifteen TINYINT,
  Thirty TINYINT
)
AS BEGIN

INSERT INTO @minutes VALUES (0,60,60,60,60)
                            (1,5,10,15,30),
                            (2,5,10,15,30),
                            /* you get the gist of it */
                            (59,60,60,60,60);

RETURN ;

END

Update: Thanks to Brad Schulz this has gone for a single post about my laziness to an example of a refactoring you can make. If you ever see a table being generated like this, you can take a look at it and determine how you can change it into an inline select. The query optimizer is going to do something completely different for each plan. With the inline select, SQL Server is able to determine that there are 60 rows in our table and can build a much more efficient execution plan.

CREATE FUNCTION [dbo].[MinutesTable] ()
RETURNS TABLE
AS 

RETURN
  SELECT  0 AS [Minute], 60 AS Five, 50 AS Ten, 60 AS Fifteen, 60 AS Thirty
  UNION ALL
  SELECT  1, 5, 10, 15, 30
  UNION ALL
  /* ... this still makes sense, right? ... */
  UNION ALL
  SELECT  59, 60, 60, 60, 60 ;
GO

A Simple Refactoring – Functions in the WHERE Clause

Putting functions in the where clause of a SQL Statement can cause performance problems. If you aren’t careful, these problems can add up and bring a powerful production system to its knees.

For this example, I’m using the Northwind database, but you could do this on any database.

The first thing to do is put an index on the OrderDate column:

CREATE INDEX IX_Orders_OrderDate
ON dbo.Orders (OrderDate) ;

Take a look at this first query:

SELECT  COUNT(*)
FROM    dbo.Orders AS o
WHERE   DATEADD(MM, 2, o.OrderDate) < '1997-05-04' ;

Scanning the stars... or something.

This returns very quickly on the Northwind database because we have such a small volume of data, but a query like this could cause a lot of problems in production. Why? Well, it’s doing a full table scan. Don’t believe me? Take a look a the execution plan over there.

The database is forcing a table scan because of the use of the DATEADD function on an indexed column. SQL Server will have to compute the value of every OrderDate plus 2 months in the database and then compare it to May 4th, 1997.

What happens if we flip this around? What if we change our query so that we’re doing the math on our input value of May 4th, 1997?

SELECT  COUNT(*)
FROM    dbo.Orders AS o
WHERE   o.OrderDate < DATEADD(MM, -2, '1997-05-04') ;

Seek and ye shall find

It turns out that if we move the function from the table to the variable (although it’s really inlined in this case) we can get SQL Server to perform an index seek. This is going to be orders of magnitude faster than scanning every row in the table and comparing the values.

So, there you have it: another easy fix that can save you a lot of headache when you’re trying to tune queries.

Installing PostgreSQL on Mac OS X

This is a pretty simple process, but one that I thought I would document because I ran into a few gotchas along the way.

I originally installed OS X using the one click installer from EnterpriseDB. Unfortunately, the installer hung while attempting to finish the installation process and they only thing to do was to roll back the install. I attempted to build from macports, but that proved to be a huge pain in my ass and reminded me a little bit too much of using Linux, so I scrapped that idea as well. I started browsing the EnterpriseDB installation notes and came across an unattended install. Since my installation failed trying to launch the final interactive stage of the installation, I said to myself “Self, I bet we should try this.”

  1. Download PostgreSQL from EnterpriseDB http://www.enterprisedb.com/products/pgdownload.do#osx
  2. Mount the OS X disk image file
  3. In the Terminal, change to the directory where the image file was mounted:
    cd /Volumes/PostgreSQL\ 8.4.2-1/
  4. Start the unattended installation
    sudo ./postgresql-8.4.2-1-osx.app/Contents/MacOS/installbuilder.sh
    --mode unattended
  5. Once the installation has finished, add the following line to your .profile:
    source /Library/PostgreSQL/8.4/pg_env.sh
    If you’ve never created your .profile you can do the following in a terminal window:

    1. touch ~/.profile
      This will create the file if it doesn’t exist and will only change the last modification date if it does.
    2. open /Applications/TextEdit.app .profile
      This opens your .profile in the TextEdit application
  6. Save and close your .profile
  7. Create a new database user.
    createuser [YOUR NAME HERE] --pwprompt --username=postgres
    Alt text for the image, e.g. “The Mona Lisa”

    This could be anyone...

  8. When you’re prompted for the “Password:”, use “postgres”. This is the default password for the postgres superuser. Because you’ve just created another superuser, you’ll never have to touch this account again, but you should probably look up some fantastic documentation and learn how to change the password.

    I'm a real live boy!

There you have it: how to set up and install PostgreSQL on OS X. If you want to continue forward and install a Ruby library to interact with PostgreSQL, just type
sudo gem install postgres-pr to install the pure Ruby version PostgreSQL driver. If you want to install one of the C-based drivers, you’ll need to mess around with macports.

Note: EnterpriseDB make and distribute a binary version of PostgreSQL for a variety of platforms. In addition, they provide paid support and also offer a variety of enterprise features. They have in no way paid me for this post.


Related Posts:

PostgreSQL Tutorial – Creating Your First Database
PostgreSQL Tutorial – Creating the Pagila Schema
PostgreSQL Tutorial – Referring to Other Tables
PostgreSQL Tutorial – Inserting Data

Links for the Week – 2010.02.12

This is more of a “what I’ve been reading” rather than a link dump from previous week’s RSS feed.

A Plea for Plain English – Tony Davis’s “A Plea for Plain English” rings home with me. Far too much writing is full of heavy, pompous words used purely to make the author feel smarter. Joseph Conrad – one of the greatest writers of the English language – was not a native speaker. Yet he wrote with a simplicity, grace, and style that is still unequaled. While we all can’t be Joseph Conrad, we can all strive to write in clear, concise, readable prose. Technical writing doesn’t need to be dry, writing effective prose can be just as much an art form as creating a brilliant short story, novel, painting, photograph, or song.

I’ve been getting a lot more interested in mathematics. Not just how they related to computers, but also how mathematics relate to design. Design, art, and aesthetics are partially governed by universal principles. There are ratios that are more visually pleasing simply because we find them in nature. Being aware of these ratios helps us create effective designs that draw in the reader and hold them to the content.

B-movies. I watch a lot of old slasher, exploitation, b-movies.

Flash, iPad, Standards – Jeffrey Zeldman talks about why the total lack of Flash on the iPad is a good thing: it provides an incredibly compelling, public, reason for designers and developers to abandon proprietary formats like Flash and Silverlight and focus on open standards. Some people would argue that a lack of Flash would kill the device but on the flip side what can you do in Flash that people need to do that you can’t do with JavaScript, HTML, and CSS? Combine that with the in browser relational data storage that HTML5 provides and there is no reason to use a proprietary graphics engine apart from vector graphics bullshittery/professional masturbation.

A Simple Refactoring – Avoiding Table Scans

Refactoring SQL code can be pretty easy. Just like refactoring any other programming language, sometimes you have to look around to find the culprit.

We have a slow running query that frequently times out when run by the users. A quick look at the query told me what was wrong. I found this clause in the middle of an otherwise simple query:

        LEFT JOIN (
                   SELECT DISTINCT
                          key,
                          column_a
                   FROM   ImportantData
                   WHERE  FieldName = 'VALUES'
                   UNION ALL
                   SELECT DISTINCT
                          key,
                          column_a
                   FROM   ImportantData_History
                   WHERE  FieldName = 'VALUES'
                  ) AS t ON other.key = t.key
        LEFT JOIN (
                   SELECT DISTINCT
                          key,
                          column_b
                   FROM   ImportantData
                   WHERE  FieldName = 'VALUESXY'
                   UNION ALL
                   SELECT DISTINCT
                          key,
                          column_b
                   FROM   ImportantData_History
                   WHERE  FieldName = 'VALUESXY'
                  ) AS t2 ON other.key = t2.key

The problem with this query is that we’re reading each table twice for the same data. I checked in the execution plan and, sure enough, SQL Server was performing two scans on the underlying data. After some careful thought I realized that I could accomplish the same thing with a single query:

        LEFT JOIN (
                   SELECT key,
                          VALUES AS column_a,
                          VALUESXY AS column_b
                   FROM   (
                           SELECT DISTINCT
                                  key,
                                  FieldName,
                                  CASE WHEN FieldName = 'VALUES'
                                       THEN column_a
                                       WHEN FieldName = 'VALUESXY'
                                       THEN column_b
                                  END AS content
                           FROM   ImportantData
                           WHERE  FieldName = 'VALUES'
                                  OR FieldName = 'VALUESXY'
                           UNION ALL
                           SELECT DISTINCT
                                  key,
                                  FieldName,
                                  CASE WHEN FieldName = 'VALUES'
                                       THEN column_a
                                       WHEN FieldName = 'VALUESXY'
                                       THEN column_b
                                  END AS content
                           FROM   ImportantData_History
                           WHERE  FieldName = 'VALUES'
                                  OR FieldName = 'VALUESXY'
                          ) AS source PIVOT( MAX(CONTENT) FOR FieldName IN ([VALUES],
                                                              [VALUESXY]) ) AS pvt
                  ) AS t ON other.key= t.key

The upside to this unreadable pile of junk is that it scans each table once and only once. This reduces the load on disk, the amount of data that needs to be read into memory, and it most likely reduces the amount of data stored in memory. And, while this only reduced the estimated query cost by 40 (580 down to 540), we all know that the query cost is only one factor in what actually affects the overall run time of a query.

Many readers are probably mortified that there’s a SELECT DISTINCT in this query. I am too. That being said, I plucked the low-hanging fruit and managed to get a huge performance boost as a result – query execution time went from over 1 minute to 14 seconds. With an improvement like that, removing that distinct isn’t on the radar any more.

When is a Lookup not a Lookup?

Execution plans are great things. They give us an insight into how SQL Server is putting together queries and why they run slowly.

One immediate thing I look for in an execution plan is a Key Lookup. In a Key Lookup operation, SQL Server has to reference the clustered index on the table because a value it’s looking for is not present in the index that was used to find the row.

Normally, when you look at the Key Lookup, you will see a list of output columns. This is a great way to help you modify your indexes and make sure that you can avoid these extra disk hits by sacrificing a tiny bit of storage space.

A Normal Output List from a Query Plan

Today, while troubleshooting a query that is never finishing, I ran across this gem.

Empty. Bereft of Meaning. Nothing.

There’s no output list in the Key Lookup. I, wrongly, thought that Key Lookups would always include an output list. What’s going on here?

Well, here’s the scoop. I have the following, sample, table:

CREATE TABLE dbo.ServiceList (
  ServiceListId BIGINT NOT NULL PRIMARY KEY,
  BillId BIGINT NOT NULL,
  ServiceMonth DATETIME
);

In this query, ServiceList is referenced through a CROSS APPLY’d inline function. The inline function references the BillId and ServiceListId. There’s an index on ServiceList that contains both of these columns:

CREATE INDEX IX_ServiceList_BillId ON dbo.ServiceList
(
  BillId
)
INCLUDE (ServiceListId, ServiceMonth);

This index is correctly being used by the query optimizer but in the past it’s only been used to perform lookups and push data to screen (or somewhere). So, what we’re seeing here makes sense. The CROSS APPLY needs to reference the BillId so the compiler checks the situation out and decides to use IX_ServiceList_BillId in order get the BillId. Since the index only includes ServiceListId, it has to perform a Key Lookup.

Protip: included data is only included in the index; it is not indexed.

To avoid this kind of behavior, I need to change this index to actually index on the combination of BillId and ServiceListId like so:

CREATE INDEX IX_ServiceList_BillId ON dbo.ServiceList
(
  BillId,
  ServiceListId
)
INCLUDE (ServiceMonth);

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 230 items have been purified.