Why use a GUID for a clustering key?
Conventional wisdom in the database world is that nobody should ever use a GUID for a clustering key. And the people who do are clearly awful people who hate kittens and everything good in the world. I’ve probably even gotten up in somebody’s business about this before. The problem with common knowledge is that it may not always be right. After all, common knowledge used to be that the moon made you crazy.
Think like an architect
Before discounting GUIDs as a clustered index, think about what your application requires.
Software and database architects take into account an application’s requirements as they go through the design process. It’s necessary to think about how an application needs to identify data as it’s generated – you might have to generate values for the clustering key in the application tier long before an object graph is persisted to the database. Or you might want to make sure that you have more than 2,147,483,647 possible keys (you could double that if you start at -2,147,483,648).
“But what about
BIGINT“, you shout. With modern use cases (like the Internet of Things) it’s possible to generate tremendous volumes of data. Long ago, I worked at a utility company. Every meter could generate readings at whatever frequency the business wanted. More frequent readings allow better long term trending, forecasting, and pricing but at a cost. With storage now being stupendously cheap (I have around 6 TB of raw SSD capacity in my desktop), it’s possible to store huge amounts of data.
Let’s think through the requirements:
- A single meter could generate 1 reading per minute which is 1,440 readings per day.
- Over 1 year that’s 525,600 readings per power meter.
- For a 32-bit integer, that’s 8,171 meter years of readings. We could exhaust that in a few years with any reasonable number of homes using our utility.
- For a 64-bit integer, that’s 35,096,545,041,304 meter years of readings.
- There are 3 million households in New York City.
- Let’s say we get half of them on our service.
- That’s still 23 million meter years.
So, what would we need GUIDs for?
We might need GUIDs if pre-processing occurs in the application to associate multiple data points with one another (foreign keys). Or if we anticipate having even more data coming into the system. With increases in metering and monitoring technology, it’s possible that we could be collect additional sensor data to go along with all of those meter readings (think HVAC and lighting information combined with weather station data). GUIDs can also help avoid intermediate page contention in write heavy workloads.
But the randoms!
GUIDs are random, at least as generated by SQL Server’s
NEWID function. And, as we all know, randomness is bad because it causes fragmentation in our data structures. Here’s the deal – unless you are only appending data to the end of a table with no indexes, all of your inserts will cause fragmentation.
Sure, GUIDs may cause fragmentation in the clustered index on your base table. Can you conclusively prove, outside of a contrived example, that the primary performance problem in your application is fragmentation caused by GUIDs? And that this isn’t just free space on disk? Remember – if you have indexes, you’re going to have out of order inserts into those indexes. Are you positive the poor performance isn’t just your awful code or bad indexing or maybe just out of date stats?
Before blaming the randomness of GUIDs, do your homework and prove where the bottlenecks exist in your application. Fix those. Find the next bottleneck. Rinse. Repeat.
What’s the takeaway?
It’s important to understand the reasons that you’re using one datatype over another. Understand the application’s patterns before you make a decision – database decisions last forever. OK, maybe they don’t last forever, but bad database decisions become difficult to change once you have any significant data volume present. Do your homework, understand how your choices affect you, and don’t be afraid to go against conventional wisdom when you know what’s right for your application.
This is Jeremiah
I live in Portland, OR. I have two dogs.
I recently received a Master's of Science in Computer Science from Portland State University.
I'm was Microsoft MVP from 2009 - 2018 with a pile of certifications. Somewhere along the way, I wrote a database client for Riak and then handed it off to the community.