When Michael J Swart asked me to take part in T-SQL Tuesday #10 – Indexes, I was incredibly flattered. Nobody’s ever asked me to do anything since a cop asked me to stop doing that one thing (speeding). I had to say yes. Here’s my contribution to T-SQL Tuesday #10.
WTF is an index?
My 1967 Children’s World Book Encyclopedia doesn’t provide a definition for an index. That’s probably because it’s an encyclopedia and not a dictionary Thanks to the local gypsy garage sale going on in my neighborhood, I was able to acquire a dictionary from 1934 which defines an index as “a leverlike regulator for a hairspring.” Apparently this has something to do with clocks. If you try to verify this, you’ll notice that it’s a horological term. Apparently that has something to do with clocks and not adult entertainment.
Obviously neither my Children’s World Book Encyclopedia nor my Book of Learnin’ and Such was going to help me define an index.
In the spirit of true science, I realized that I would have to make something up. Or do my homework. One of the two.
SQL Server doesn’t have a wealth of indexing options available. In fact, there are fundamentally only two options – indexed and STFU. Sure, there are multiple indexing techniques, but this is my story.
At the core of SQL Server’s indexing strategy is the self-balancing binary search tree or, for those who don’t like to sound like pompous assfaces, the b-tree. B-trees are full of special magic. Well, not really. A b-tree is based on an algorithmic model that strives to keep the height of the search tree (the number of nodes between the root and the leaf nodes) as small as possible. This means that finding any single piece of information should be equally as cheap/expensive as finding any other piece of information.
That sounds great, right? Well, one of the fundamental indexing problems is that the data in the index has to be kept up to date as the data in the table changes. You can’t have data flying around willy nilly.
This is where we get to start sub-typing our b-trees.
Yes, I lied to you. Sort of. SQL Server has multiple types of indexes, just not multiple… types of index. There’s no way to control the indexing mechanism that SQL Server uses, you’re stuck with the pompous assfacery of self-balancing binary search trees (it’s not really a bad thing).
So, getting back to the topic at hand, in SQL Server we do have options. Two of them: clustered and non-clustered indexes.
A clustered index is an ordered collection of nuts stored for winter and categorized appropriately… or data. One of the two. When we create a clustered index, all of the nuts/data is sorted and written to disk in the appropriate logical sort order. From this point forward, no physical order is guaranteed. Once the clustered index is created and the data is placed in physical order all bets are off. The data will be stored in logical order, but there is a chance that the data may not be physically in order due to the vagaries of disk access, deleting rows and/or tables and other assorted whatnot. (I think the horologists might have something to do with it.)
What goes in a clustered index? Everything. I told you the data was copied in. Didn’t I? Well, it is. Our clustered index is all of the data in the table and it’s in order based on the clustering key value.
The choice of the b-tree algorithm actually has some implications for how we choose the index columns for the clustered index. (I’m going to call these columns the clustering key. That’s what they are. Deal with it.) Because we’re using a B+ tree, instead of another data structure, to store and sort indexed data we have to take a few things into consideration. B+ trees do really well when every indexed value is unique. This is by design – it’s easier to look up any given row if you know that there is only one row that corresponds to that key. If there’s more than one, things get tricky and I’m not going to talk about that.
B+ trees have other characteristics that make them optimal for use in databases – it’s really easy to find any single element in the tree (even compared to other binary tree mechanisms) and they’re wonderfully optimized for read access when data is being read in sequence across ranges of data.
Sometimes you won’t know the key to reference a particular piece of data. Sometimes you’ll want to pull back records for everyone who lives in Tuscaloosa, AL. If you haven’t ordered your data by city and state, this is going to be really difficult to find without scanning the entire table. How do we get that data back?
In addition to clustered indexes, which define how the data should be ordered on disk, we also have non-clustered indexes. We can use these guys to help speed up our search. These are really just supplementary b-trees that point back to the clustered b-tree. You do have a clustered index, right?
No. Of course it can get more complicated than all of this, you can have composite indexes with multiple columns, non-unique indexes, indexes with computed columns, and even full-text indexes. Unfortunately the hookers and their lever actuated hairsprings are not going to let me keep talking.
In case you’re wondering what would happen to your database if you didn’t have any indexes, take a look at this: