April 2011
Mon Tue Wed Thu Fri Sat Sun
« Mar   May »
 123
45678910
11121314151617
18192021222324
252627282930  

Month April 2011

In the Event That Everything Should Go Terribly Right

Astute readers and internet stalkers will have noticed that I left my job at Quest Software back in March. I wasn’t unhappy, I just had the opportunity to take my show on the road and go solo. I’ve had the idea of being my own boss in the back of my head for along time. Suddenly I was confronted with a situation where a former pipe dream was all too real. I talked it over with a few friends and made the plunge.

Right around the same time, I started talking to Brent about his plans. This turned into talking to Brent and Tim about their plans. Then we looped Kendra in. It turns out that we all have similar goals and dreams. It only made sense to join forces and fight crime together! After evaluating the insurance costs of fighting crime we decided to become consultants. And thus Brent Ozar PLF was born.

I’ve never been more excited to work with a group of people. Brent, Tim, and Kendra have always gotten along. I’ve never felt more supported and challenged by a group of people. My business partners are three friends who have always encouraged me to excel. Whether it’s been learning about SQL Server, Ruby, or non-relational databases, these three have been there supporting me every step of the way, even when we’ve disagreed.

I could make a list of all of the other reasons that I’m looking forward to building this business, but it all boils down to the way we interact. Brent, Tim, and Kendra challenge me to be better at everything I do. Whether it’s my SQL Server skills, writing, or presenting, they’re always there helping me get better. I couldn’t ask for a better core group of friends to join me on this new endeavor.

What does the future hold? In terms of business, I’m excited to be building a business with Brent, Tim, and Kendra. Our interests are similar enough that we complement each other but they’re diverse enough that I know we’re going to educate and challenge each other.

You can learn more about our services at http://brentozar.com. If you want to get in touch, you can do that too.

PostgreSQL Update Internals

I recently covered the internals of a row in PostgreSQL, but that was just the storage piece. I got more curious and decided that I would look into what happens when a row gets updated. There are a lot of complexities to data, after all, and it’s nice to know how our database is going to be affected by updates.

Getting Set Up

I started by using the customer table from the pagila sample database. Rather than come up with a set of sample data, I figured it would be easy to work within an existing set of data.

The first trick was to find a customer to update. Since the goal is to look at an existing row, update it, and then see what happens to the row, we’ll need to be able to locate the row again. This is actually pretty easy to do. The first thing I did was retrieve the ctid along with the rest of the data in the row. I did this by running:

SELECT  ctid, *
FROM    customer
ORDER BY ctid
LIMIT 10 ;

The first ten rows

This gives us the primary key of a customer to mess with as well as the location of the row on disk. We’re going to be looking at the customer with a customer_id of 1: Mary Smith. Using that select statement, we can see that Mary Smith’s data lives on page 0 and in row 1

Updating a Row

Now that we know who we’re going to update, we can go ahead and mess around with the data. We can take a look at the row on the disk using the get_raw_page function to examine page 0 of the customer table. Mary Smith’s data is at the end of the page.

Why is Mary’s data the first row in the table but the last entry on the page? PostgreSQL starts writing data from the end of the page but writes item identifiers from the beginning of the page.

We already know that Mary’s row is in page 0, position 1 because of the ctid we retrieved in our first query. Let’s see what happens when we update some of Mary’s data. Open up a connection to PostgreSQL using your favorite interactive querying tool. I use psql on the command prompt, but there are plenty of great tools out there.

BEGIN TRANSACTION

UPDATE  customer
SET     email = 'mary.smith@gmail.com'
WHERE   customer_id = 1 ;

Don’t commit the transaction yet!

When we go to look for Mary’s data using the first select ordered by ctid, we won’t see her data anywhere.

The first ten rows, after the update

Where did her data go? Interestingly enough, it’s in two places right now because we haven’t committed the transaction. In the current query window, run the following command:

 SELECT ctid, xmin, xmax, * FROM customer WHERE customer_id = 1;

The data has moved to a new page

After running this, we can see that the customer’s row has moved off of page 0 and is now on page 8 in slot 2. The other interesting thing to note is that the xmin value has changed. Transactions with a transaction id lower than xmin won’t be able to see the row.

In another query window, run the previous select again. You’ll see that the row is still there with all of the original data present; the email address hasn’t changed. We can also see that both the xmin and xmax columns now have values. This shows us the range of transactions where this row is valid.

The row after an update, from a different transaction

Astute readers will have noticed that the row is on disk in two places at the same time. We’re going to dig into this in a minute, but for now go ahead and commit that first transaction. This is important because we want to look at what’s going on with the row after the update is complete. Looking at rows during the update process is interesting, but the after effects are much more interesting.

Customer Page 0 - After the Update

Looking at page 0 of the customer table, we can see that the original row is still present. It hasn’t been deleted yet. However, PostgreSQL has marked the row as being “old” by setting the xmax value as well as setting the t_ctid value to 00 00 00 08 00 02. This tells us that if we look on page 8 in position 2 we’ll find the newest version of the data that corresponds to Mary Smith. Eventually this old row (commonly called a dead tuple) will be cleaned up by the vacuum process.

Customer Page 8 - After the Update

If we update the row again, we’ll see that it moves to a new position on the page, from (8,2) to (8,3). If we did back in and look at the row, we’ll see that the t_ctid value in Mary Smith’s record at page 8, slot 2 is updated from 00 00 00 00 00 00 to 00 00 00 08 00 03. We can even see the original row in the hex dump from page 8. We can see the same information much more legibly by using the heap_page_items function:

select * from heap_page_items(get_raw_page('customer', 8));

There are three rows listed on the page. The row with lp 1 is the row that was originally on this page before we started messing around with Mary Smith’s email address. lp 2 is the first update to Mary’s email address.

Looking at t_infomask2 on row 2 we can immediately see two things… I lied, I can’t immediately see anything apart from some large number. But, once I applied the bitmap deciphering technology that I call “swear repeatedly”, I was able to determine that this row was HEAP_HOT_UPDATED and contains 10 attributes. Refer to htup.h for more info about the bitmap meanings.

The HOTness

PostgreSQL has a unique feature called heap only tuples (HOT for short). The HOT mechanism is designed to minimize the load on the database server in certain update conditions:

  1. A tuple is repeatedly updated
  2. The updates do not change indexed columns

For definition purposes, an “indexed column” includes any columns in the index definition, whether they are directly indexes or are used in a partial-index predicate. If your index definition mentions it, it counts.

In our case, there are no indexes on the email column of the customer table. The updates we’ve done are going to be HOT updates since they don’t touch any indexed columns. Whenever we update a new row, PostgreSQL is going to write a new version of the row and update the t_ctid column in the most current row.

When we read from an index, PostgreSQL is going to read from the index and then follow the t_ctid chains to find the current version of the row. This lets us avoid additional hits to disk when we’re updating rows. PostgreSQL just updates the row pointers. The indexes will still point to the original row, which points to the most current version of the row. We potentially take an extra hit on read, but we save on write.

To verify this, we can look at the index page contents using the bt_page_items function:

SELECT  *
FROM    bt_page_items('idx_last_name', 2)
ORDER BY ctid DESC;

We can find our record by moving through the different pages of the index. I found the row on page 2. We can locate our index row by matching up the ctid from earlier runs. Looking at that row, we can see that it points to the ctid of a row with a forwarding ctid. PostgreSQL hasn’t changed the index at all. Instead, when we do a look up based on idx_last_name, we’ll read from the index, locate any tuples with a last name of ‘SMITH’, and then look for those rows in the heap page. When we get to the heap page, we’ll find that the tuple has been updated. We’ll follow the update chain until we get to the most recent tuple and return that data.

If you want to find out more about the workings of the Heap Only Tuples feature of PostgreSQL, check out the README.

SQLPeople the First

This past weekend I had the honor and pleasure of speaking at the first SQLPeople event in Richmond, Virginia.

The Back Story

Back in February, Andy Leonard asked me to speak at a new event. The idea behind his new event was inspiration. Instead of focusing on educating others, Andy asked us to share our own inspiration. Instead of presenting a seminar or training course, the idea was to talk about my work, my vision, and my passion for database technology (to steal Andy’s own words).

It didn’t take me long to say “Yes!” It’s not every day that I get asked to talk about myself for more than 30 seconds, much less be openly invited to talk about myself with a group of people, slides, a projector, and a microphone. Can you imagine that? 40 minutes of nothing but me? Already, I could see that Andy was a man of vision.

My Original Idea

My original idea was to talk about federated databases, SQL/MED, and hybrid data. I’m really glad that Andy asked me for my abstract a second time; once he did, I couldn’t work those ideas together into anything vaguely resembling a coherent narrative.

The funny thing about inspiration is that it’s different from interest. I think federated databases are interesting; I’ve written about them enough. I think that SQL/MED is really cool; that’s how I dug into federated databases. I certainly think that hybrid data/polyglot persistence/buzzword du jour is an interesting idea; I’m talking about it at Stir Trek.

Ultimately none of these things get me excited at a base level: they’re interesting but not intriguing.

The Evolution

I sat down in front of the computer and brain stormed. I don’t think the idea really gelled until I was furiously re-typing my abstract over and over again. As I wrote, the core idea changed a number of times. Free-writing is a wonderful tool, but it’s incredibly painful when you’re trying to write something that you promised to a dear friend. Especially when you realize how late you are.

As I kept iterating over my original ideas I realized that I wasn’t excited and moved by them. I just kept typing, though. I know from experience that I can iterate through ideas over and over again and eventually get to something great. Turns out that’s exactly what happened. After slapping at the keyboard for a while, I cranked out this beauty:

If you asked Jeremiah Peschka to pick three things he’s interested in about computers, he’d say “data” and then look at you funny. If you asked him again, he’d saying “data design, database design, and designing around the limits of the first two.” This is a rapid tour of the building blocks of databases, how those choices affect what we do with data, and why we have to break the rules from time to time to get things done.

The Talk

I had a lot of fun talking about how different aspects of software design, database design, and hardware design influence the design of databases. The talk covered some of the theoretical underpinnings of databases at a very high level, it only scratched the surface of some of the things that I wanted to discuss. Computer science is such a broad field and even a small piece of it like database design can contain a huge amount of information that it’s difficult to pick and choose the relevant parts.

The twenty minute format was a welcome limitation; it forced me to focus on what was most important in my topic. I had to focus on what inspires me to keep learning and what inspires me to share part of my journey with the other people.

Thank you to everyone who put on the event and everyone who attended. It was a great opportunity to share the things I enjoy about this field.

The Future

You’d think things would end here, after all I gave a talk about my inspiration once, right?

I had so much fun putting the talk together and got so much great feedback from the attendees that I thought I would refine the short talk into a longer version. There’s a lot of stuff that I left out of the presentation. It’s not that it wasn’t relevant, but I had to trim material so I could finish the talk in 20 minutes. I want to revisit the talk an add more material; there’s so much interesting information out there that it seems a shame to not share it.

Meme Monday: Eleven Words or Less

For best performance, use RAID 10. Always.

Tagged: Erin, Jes.


Tom LaRock challenged me to write an 11 word or less blog post. I decided to offer some succinct advice as well as tag two people, all in under 11 words.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 214 items have been purified.