Category SQL Server

T-SQL Tuesday – Indexes

arbitrary logo because a world without rules is a world with fun

When Michael J Swart asked me to take part in T-SQL Tuesday #10 – Indexes, I was incredibly flattered. Nobody’s ever asked me to do anything since a cop asked me to stop doing that one thing (speeding). I had to say yes. Here’s my contribution to T-SQL Tuesday #10.

An index is a horological lever-like something or other.

WTF is an index?

My 1967 Children’s World Book Encyclopedia doesn’t provide a definition for an index. That’s probably because it’s an encyclopedia and not a dictionary Thanks to the local gypsy garage sale going on in my neighborhood, I was able to acquire a dictionary from 1934 which defines an index as “a leverlike regulator for a hairspring.” Apparently this has something to do with clocks. If you try to verify this, you’ll notice that it’s a horological term. Apparently that has something to do with clocks and not adult entertainment.

Obviously neither my Children’s World Book Encyclopedia nor my Book of Learnin’ and Such was going to help me define an index.

In the spirit of true science, I realized that I would have to make something up. Or do my homework. One of the two.

SQL Server and a Drought of Options

SQL Server doesn’t have a wealth of indexing options available. In fact, there are fundamentally only two options – indexed and STFU. Sure, there are multiple indexing techniques, but this is my story.

At the core of SQL Server’s indexing strategy is the self-balancing binary search tree or, for those who don’t like to sound like pompous assfaces, the b-tree. B-trees are full of special magic. Well, not really. A b-tree is based on an algorithmic model that strives to keep the height of the search tree (the number of nodes between the root and the leaf nodes) as small as possible. This means that finding any single piece of information should be equally as cheap/expensive as finding any other piece of information.

That sounds great, right? Well, one of the fundamental indexing problems is that the data in the index has to be kept up to date as the data in the table changes. You can’t have data flying around willy nilly.

This is where we get to start sub-typing our b-trees.

Oh, So There ARE Options

Yes, I lied to you. Sort of. SQL Server has multiple types of indexes, just not multiple… types of index. There’s no way to control the indexing mechanism that SQL Server uses, you’re stuck with the pompous assfacery of self-balancing binary search trees (it’s not really a bad thing).

So, getting back to the topic at hand, in SQL Server we do have options. Two of them: clustered and non-clustered indexes.

Cluster This!

A clustered index is an ordered collection of nuts stored for winter and categorized appropriately… or data. One of the two. When we create a clustered index, all of the nuts/data is sorted and written to disk in the appropriate logical sort order. From this point forward, no physical order is guaranteed. Once the clustered index is created and the data is placed in physical order all bets are off. The data will be stored in logical order, but there is a chance that the data may not be physically in order due to the vagaries of disk access, deleting rows and/or tables and other assorted whatnot. (I think the horologists might have something to do with it.)

What goes in a clustered index? Everything. I told you the data was copied in. Didn’t I? Well, it is. Our clustered index is all of the data in the table and it’s in order based on the clustering key value.

The choice of the b-tree algorithm actually has some implications for how we choose the index columns for the clustered index. (I’m going to call these columns the clustering key. That’s what they are. Deal with it.) Because we’re using a B+ tree, instead of another data structure, to store and sort indexed data we have to take a few things into consideration. B+ trees do really well when every indexed value is unique. This is by design – it’s easier to look up any given row if you know that there is only one row that corresponds to that key. If there’s more than one, things get tricky and I’m not going to talk about that.

B+ trees have other characteristics that make them optimal for use in databases – it’s really easy to find any single element in the tree (even compared to other binary tree mechanisms) and they’re wonderfully optimized for read access when data is being read in sequence across ranges of data.

Searching Is Good

Sometimes you won’t know the key to reference a particular piece of data. Sometimes you’ll want to pull back records for everyone who lives in Tuscaloosa, AL. If you haven’t ordered your data by city and state, this is going to be really difficult to find without scanning the entire table. How do we get that data back?

More indexes!

In addition to clustered indexes, which define how the data should be ordered on disk, we also have non-clustered indexes. We can use these guys to help speed up our search. These are really just supplementary b-trees that point back to the clustered b-tree. You do have a clustered index, right?

That’s it?

No. Of course it can get more complicated than all of this, you can have composite indexes with multiple columns, non-unique indexes, indexes with computed columns, and even full-text indexes. Unfortunately the hookers and their lever actuated hairsprings are not going to let me keep talking.

In case you’re wondering what would happen to your database if you didn’t have any indexes, take a look at this:

i can't find any indexes!

Things I Read This Week – 2010.08.03

Just in case you want to live vicariously through my reading choices, here are some of the things that I found interesting this week when I should have been doing my job.

Data

An Illustrated Guide to the PostgreSQL Buffer Cache – Just in case you were wondering how a buffer cache works in a database. Because, you know… it’s cool. For you SQL Server people out there, this is conceptually identical to how buffer pool scans work in SQL Server. I suspect this is the same everywhere and probably based on an obscure academic paper.

Inside the Optimizer: Plan Costing – The SQL Server optimizer is an interesting creature. You can’t see what it’s doing directly because it’s full of patents and secrets (and probably dragons). Paul White (blog | twitter) has been doing a phenomenal job recently of taking apart the optimizer step by step and showing how it works.

The problems with ACID and how to fix them without going NoSQL – You know all of those reasons your developers are giving you to abandon SQL Server for a NoSQL database? This guy claims to have the answer. It boils down to using stronger ACID compliance and sounds suspiciously like Postgre-XC. If you don’t want to read several lengthy academic papers (really? you don’t want to?), the basic premise is that we should use stronger ACID controls by strengthening isolation levels and making the database deterministic such that the outcome of three transactions (A, B, and C) on all nodes will be the same regardless of execution order.

Using Riak’s map/reduce for sorting – Many NoSQL databases use something similar to MapReduce instead of SQL. The magic of map/reduce isn’t always apparent (it isn’t always apparent to me, that’s for sure). This article gives a well-commented example of the code you would use to write an ORDER BY date DESC query.

Configuring Mongo Replica Sets – Kristina Chodorow covers how you would set up Replica Sets to eliminate any single point of failure in MongoDB. Good to know ;)

10 things you should know about about NoSQL databases

Code

RubyDoc.info – I haven’t really been reading this one so much as I am amazed by it. RubyDoc.info combs the Ruby source code in a few locations and dynamically generates up to the second documentation based on the current stable versions of libraries. Go go magic clouds!

TextMate’s Missing Drawer – Mac fiends, take note! I’ve always thought TextMate’s project drawer was a bit lacking. This really improves on the built in drawer and integrates it a lot better into the application. (Found via The Hidden Magic of TextMate.)

Other Stuff

The Importance of a Mentor – My good friend, and long time unknowing mentor, Jonathan Kehayias talks about the importance of having a mentor. Take some time and thank the people who have mentored you throughout your career.

Finding SQL Agent Jobs Owned By The Wrong Person

Ever have someone leave the company only to find out that they own critical database processes… because that process failed? It hasn’t happened to me, yet, but a recent Active Directory outage this weekend got me thinking about it.

Kendra Little at CBusPASS

Hey, good news! Kendra Little (blog | twitter) will be presenting for us, remotely, on August 12th at 6:30 PM. The meeting is at the usual place, Battelle for Kids – 1160 Dublin Rd Suite 500, Columbus, OH 43215.

What is Kendra going to be talking about?

Stay Agile, Stay Sane

Agile software development emphasizes continuous depolyment and its methods do not directly include long term planning. DBAs must ensure data integrity and have a long term view for application scale, so Agile methods present challenges. Come learn about successful real-world practices iteratively developed in a high transaction internet service environment over the last five years. We have create a flourishing Agile shop while meeting high requirements for uptime, customer response, and data consistency. In this session we’ll cover key habits for success, practices to avoid, how and when to get started, and why Agile development can be a great thing for DBAs. Topics will also include how “Testing in Production” can be a huge benefit.

About Kendra

Kendra Little is a Senior DBA in the online advertising industry who has spent ten years nerding out on SQL Server. Kendra works closely with an Agile development team to deploy frequent incremental changes to scale and improve a busy production environment. She likes tuning production servers, developing tools to automate tasks, building SQL Reports for trending, and secretly enjoys writing troubleshooting guides and documentation. Kendra has a Masters degree in Philosophy and a salt shaker full of certifications, but all the best stuff she’s ever learned has come from her smarty-pants colleagues and the SQL Server community. Read her blog at http://littlekendra.com.

Finding Cross-Database Dependencies

Ever want to know how many queries are referring to other databases on a server? How about a different server?

Worry no more! I have a query that will help you answer these pesky questions:

SELECT  DB_NAME() AS current_db_name,
        OBJECT_NAME(referencing_id) AS o_name ,
        UPPER(COALESCE(sed.referenced_server_name, '')) AS referenced_server_name,
        sed.referenced_database_name,
        sed.referenced_schema_name,
        sed.referenced_entity_name
FROM    sys.sql_expression_dependencies AS sed
WHERE   referenced_database_name <> DB_NAME()
        AND referenced_database_name <> 'msdb'
ORDER BY UPPER(referenced_server_name) ;

Want to run it across every database on your server? We got that, too.

DECLARE @command AS NVARCHAR(MAX);

SET @command = 'USE ?;

SELECT  DB_NAME() AS current_db_name,
        OBJECT_NAME(referencing_id) AS o_name ,
        UPPER(COALESCE(sed.referenced_server_name, '''')) AS referenced_server_name,
        sed.referenced_database_name,
        sed.referenced_schema_name,
        sed.referenced_entity_name
FROM    sys.sql_expression_dependencies AS sed
WHERE   referenced_database_name <> DB_NAME()
        AND referenced_database_name <> ''msdb''
ORDER BY UPPER(referenced_server_name) ;'

EXEC sys.sp_MSforeachdb @command1 = @command

Update: This only applies to SQL Server 2008. Nothing else. Thanks to an astute reader I have been corrected. Now we’re all smarter. Thanks, John. :)

Revisiting Lady MacBeth and Her Torturous Lies

A while back, I wrote up a genius piece of code that would automatically shrink my log files whenever they grew.

Kendra Little (blog | twitter) completely called me out for my horrible, sneaky, developer ways. Ostensibly, I had found a solution for my rampant log growth problem. Unfortunately, I had cured the symptom and not the underlying issue. After growing tired of her savage abuse and criticism via gtalk, I looked for the source of the problem. No, not me. The other source of the problem.

I set up monitoring on the server in question, waited for the appropriate log death window, and then read my report. Before you think I’m using fancy tools that nobody can afford, I set up profiler and perfmon and then merged the results together.

The reports from the single server showed me… nothing, really. There was a lot of I/O and a backup job overlapped with a re-index by about 2 minutes. The logs also didn’t fill up. To be on the safe side, I adjusted the jobs and then sat around making frowny faces for a few minutes. Then I remembered that all of the servers are connected to the same SAN, so set up monitoring on the remaining production servers. An I/O issue on one server could start causing problems on all of the other servers.

This time around, the logs filled up, I received a ton of emails, and I also found out something important: all of my backups and re-indexing operations were running at the same time. My SAN was saturated on I/O throughput which was causing a the backup and re-indexing jobs to run slowly.

To solve the problem I looked at the average job run times and arranged the jobs so that they had much more downtime between them (to account for other issues that could slow down the jobs). This took a bit more effort than I thought just because of SLAs within the company. I also re-wrote the jobs so that the backups and re-indexes could never run at the same time and would, instead, occur in series. Once I had this change in place I waited and watched.

Sure enough, the incredible ever growing log file problem stopped happening (unless I do something dumb like move 30,000,000 rows of data). Moral of the story: make sure that you’re addressing the cause of the problem and not the symptoms.

TFS and You

How many of you are using Visual Studio 2010? How many of you are writing SSIS packages? How many of you are pissed off, irritated, or confused about how you’ll be able to write and version control your SSIS packages (VS 2008) once you’ve upgraded to TFS 2010?

Confused yet? Yeah, so am I.

Let’s backtrack: at work we recently upgraded to VS 2010 and TFS 2010 – this lets everyone on my team take advantage of the database developer edition as well as add a whole slew of automated testing features and doodads. Plus our management can tie in to our tasks

One of the problems that we ran into is connecting to TFS2010 from VS2008. Why would I need to do that? Visual Studio 2010 doesn’t have support for the SQL Server 2008 BI project types yet (SSRS, SSIS, etc). So, in order to work on SSIS packages, I need to keep VS 2008 around.

It’s pretty easy to connect Visual Studio 2008 to TFS 2010. Download this hotfix, install it, and then you’re up and running. That’s really all there is to it.

Using Indexed Views to Replace Triggers

Let’s get this out of the way: I think triggers are cool. I also think triggers have their place, they just need to be used with care. One reason for this is that triggers are synchronous. That is – trigger actions will block a command from returning to the client until the the trigger’s actions have completed.

What’s the solution? Do we run summary jobs on a regular basis? Maybe.

Sometimes summary jobs aren’t enough. Sometimes we have to update summarized data in real time. Let’s break out AdventureWorks and look at an example.

Let’s say we have a report that shows us the total line item sales per customer broken out by year and month.

That’s pretty simple, right?

SELECT  p.BusinessEntityID ,
        p.FirstName ,
        p.LastName ,
        DATEPART(yyyy, soh.OrderDate) AS [Year] ,
        DATEPART(mm, soh.OrderDate) AS [Month] ,
        SUM(sod.LineTotal) AS LineItemTotal
FROM    Person.Person AS p
        INNER JOIN Sales.Customer AS c ON c.PersonID = p.BusinessEntityID
        INNER JOIN Sales.SalesOrderHeader AS soh ON c.CustomerID = soh.CustomerID
        INNER JOIN Sales.SalesOrderDetail AS sod ON soh.SalesOrderID = sod.SalesOrderID
GROUP BY p.BusinessEntityID ,
        p.FirstName ,
        p.LastName ,
        DATEPART(yyyy, soh.OrderDate) ,
        DATEPART(mm, soh.OrderDate)
ORDER BY p.BusinessEntityID ,
        [Year] ,
        [Month] ;

The problem is that this query may use a lot of CPU or disk resources. How can we avoid that? Earlier I mentioned using summary jobs. The problem is that the summary jobs will only run on a schedule and require that we maintain some kind of maintenance history of each row – even if that maintenance history is just a modification timestamp. Wouldn’t it be easier if our summary report just updated itself automagically?

It can! Enter the indexed view.

An indexed view is a regular view that has a clustered index on it. By adding the clustered index the view is persisted to disk. The upside of persisting the indexed view to disk is that there’s an automatic mechanism in place that will update the indexed view. An indexed view must have a unique clustered index so that SQL Server can identify the materialized row that must be updated in response to changes in the underlying data.

CREATE VIEW dbo.my_indexed_view
WITH SCHEMABINDING
AS
  SELECT  p.BusinessEntityID,
          p.FirstName,
          p.LastName,
          DATEPART(yyyy, soh.OrderDate) AS [Year],
          DATEPART(mm, soh.OrderDate) AS [Month],
          SUM(sod.LineTotal) AS LineItemTotal,
          COUNT_BIG(*) AS NumberOfRecords
  FROM    Person.Person AS p
          INNER JOIN Sales.Customer AS c ON c.PersonID = p.BusinessEntityID
          INNER JOIN Sales.SalesOrderHeader AS soh ON c.CustomerID = soh.CustomerID
          INNER JOIN Sales.SalesOrderDetail AS sod ON soh.SalesOrderID = sod.SalesOrderID
  GROUP BY p.BusinessEntityID,
          p.FirstName,
          p.LastName,
          DATEPART(yyyy, soh.OrderDate),
          DATEPART(mm, soh.OrderDate) ;
GO

CREATE UNIQUE CLUSTERED INDEX CX_my_indexed_view ON dbo.my_indexed_view (BusinessEntityID, [Year], [Month]) ;

We could have accomplished this using a summary table and triggers. Triggers are synchronous and updating the summary table will block our data changes from completing. In addition, if we ever want to bypass the trigger from updating (say during a huge data load) we need to disable and then enable the trigger again and then remember to update the summary table. With an indexed view we can just drop the clustered index and then recreate it when we’re done with our load processes.

Note: I can’t take full credit for this question and explanation. A SQL Saturday #42 attendee sent me an email with a question about using indexed views instead of triggers.

Free Training – SQL Saturday 42

Good news! I’m speaking at SQL Saturday 42 this Saturday. Got nothing to do? Head on down to Goodwill Columbus at 1331 Edgehill Rd, Columbus, OH. Got something to do? Cancel it.

I’m excited about the presentations I’m giving – I haven’t given the indexing presentation in a long time and it should be a lot of fun. And the Dynamic SQL presentation is one of the first in the day. It’s a nice easy way (I think) to get your day started.

A Dynamic World Demands Dynamic SQL

Dynamic SQL is a misunderstood and much maligned part of a DBA’s tool kit – it can be used to solve difficult business problems, respond to diverse data needs, and alleviate performance problems. Many DBAs reject dynamic SQL outright as a potential source of SQL injections, being poorly performing, or just for being a hacky solution in general. Not so! Jeremiah Peschka has been making extensive use of dynamic SQL throughout his career to solve a variety of problems. He’ll set about dispelling these misconceptions and demonstrate how dynamic SQL can become a part of every DBA’s tool kit.

Indexes And Other Free Performance Boosts

The database is often viewed as a major performance bottleneck. There are a number of quick, easy, painless techniques that can increase the performance of an application not just by a small amount, but by orders of magnitude. These techniques includes simple indexing techniques, T-SQL techniques, and general database application design patterns that give great gains in performance. In this session, you will learn how to look at a database to identify these problem areas and how to resolve common issues that you will encounter.

What else?

Let’s say you’re interested in something else. What should you go see? Well, Michael Swart (blog | twitter) put together a nice little blog post about How I plan to spend my weekend.

If you like business intelligence, I suggest you hit up Dave Rodabaugh’s presentations. I cannot speak highly enough of Dave’s work. Not only is he one of the brightest BI people I know, he’s also been a teacher, friend, and mentor to me for a long time.

There will be some kind of dinner/drinks/whatever going on afterwards at Barley’s Smokehouse (map). I plan on being there for a little bit. Even if you can’t make it to the event, head on over there around 6:00PM. I’ll be there.

Shrink, Damn’d Log! Shrink, I Say!

Ever have a database log file grow to epic proportions on the weekend? I have. Ever forget to set up job monitoring on that server? I have. Ever get so pissed off that you decided to write a job to automatically shrink the log files whenever they grow? I have.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.