PostgreSQL Row Storage Fundamentals

I answered a question a few days ago on Stack Overflow about PostgreSQL & SQL Server btree storage fundamentals. After I answered the question, I started thinking more and more about how PostgreSQL stores data on the row. Rather than be content to live in ignorance, I went digging through PostgreSQL’s source code and some hex dumps to find out what was going on inside a row.

Getting Started

To get started, we’ll create a separate test database that we’ll cunningly call test.

createdb test --username=jeremiah
psql test -f pageinspect.sql

The next step is to set up our sample data. Here we create a small sample table with four rows:

CREATE TABLE tree (key int NOT NULL, id int NOT NULL);
ALTER TABLE tree ADD CONSTRAINT pk_tree PRIMARY KEY (key, id);
INSERT INTO TREE (Key, ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4);

You can copy this and run it in psql or any other tool that lets you connect to PostgreSQL and run commands. When we run the ALTER TABLE to create the primary key PostgreSQL will go ahead and create an index for us: NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "pk_tree" for table "tree". SQL Server will do something similar. The implementation is different, but we’ll get to that in a bit.

Read page items

In order to view data on the page we have two options. Option 1: shutdown PostgreSQL, open our copy of the source, fire up our trusty hex editor, and then hate life for a long time. Option 2: use the pageinspect contrib module to give you some extra functionality that will help us read page data.

heap_page_items will show us the data in the table. PostgreSQL stores table data in unordered heaps, hence the name heap_page_items. The heap_page_items function takes one input: a byte array representing the page. Thankfully, we can get this from using get_raw_page. This is what the first row of output will look like (magically formatted for your display):

SELECT * FROM heap_page_items(get_raw_page('tree', 0));
-[ RECORD 1 ]------
lp          | 1
lp_off      | 8160
lp_flags    | 1
lp_len      | 32
t_xmin      | 1019
t_xmax      | 0
t_field3    | 0
t_ctid      | (0,1)
t_infomask2 | 2
t_infomask  | 2048
t_hoff      | 24
t_bits      | (NULL)
t_oid       | (NULL)

That’s a lot of gibberish, but it’s going to help us find our row data when we look at the raw page in a hex editor.

Reading the Raw Page

While we’re on the subject, let’s look at the raw page:

SELECT  get_raw_page::text
FROM    get_raw_page('tree', 0);

This will look like pure gibberish (unless you like reading hex dumps_. I took the output and pasted it into a hex editor. It’s still gibberish, but at least we can hope to read this with the hex editor making life easier.

To find the first row that we inserted (1,1) we’re going to use the lp_off from heap_page_items to get the offset from the start of the database page This is where we are going to start staring blankly. In order to figure out just how much data we need to look at, the lp_len column helpfully tells us the length in bytes of the row (including header). To look at the first row that we inserted, we want to start at byte 8160 and look at the next 32 bytes; we’ll be reading the last 32 bytes in the page.

Looking at the Row

Now that we know where to look, we can look at the row structure of the row. What we’re really interested in is the data on the page, but we’re going to take a look at the row header first.

PostgreSQL’s row structure is covered in detail in the Database Page Layout documentation, but let’s have some fun and break down the technical jargon.

What's in a page - the first part

The first 4 bytes are a TransactionId: t_xmin. This is the first transaction where our row is visible. Since PostgreSQL uses MVCC for data management in transactions, t_xmin will refer to the transaction when this particular row was inserted. In my case, this row was inserted during transaction id 1019.

You might notice that a lot of these columns start with t_ That’s because they’re columns that refer to the current tuple. lp_ refers to a line pointer. This is the pointer to the location on the page.

The next 4 bytes are another TransactionId: t_xmax. Logically, if t_xmin is the first time a row was visible, t_xmax would refer to the last time a row is visible. This is the transaction ID that identifies when this row was deleted. One of the interesting things about PostgreSQL’s implementation is that there may be many copies of a row at the same time in the database. These rows will all have different t_xmin and t_xmax values. Over time PostgreSQL’s vacuum process will remove the deleted rows from the table.

After the t_xmin and t_xmax headers, we’ve got another 4 bytes that heap_page_bytes refers to as t_field3. Thankfully, we can take a look at the source code, which has far more information for us. The developers have provided a lot more detail to make it easier to understand what’s going on . t_field3 can actually contain two pieces of information: t_cid or t_xvac. This field will contain either a pointer to the current version row (it could even refer to itself) or else the transaction ID that deleted this row. In my case, this field is 0x00000000. Initially I was confused by this and couldn’t figure out what it meant. Then I started digging around in the code again (see how this helps?) and I found another gem of a comment referring to what is really going on. The t_cid is a composite field that can be used to backtrack to a min and max command but is typically used when inserting or deleting so we can backtrack to the original version of the row. The min and max command are only interesting within context of that transaction. The other information that can be stored in t_field3 is an Xvac identifier. This is an older feature that tracks the transaction id of the older VACUUM FULL command. t_field3 is blank in this case because there is not a transaction in progress.

The next field seems somewhat similar: t_ctid. This stores a pointer to the current version of this row. When a row is updated, PostgreSQL will write a new version of the row and mark the old version as deleted (a new version of the row is written and the original row’s t_xmax field will gets a value). To make minimal changes on disk PostgreSQL will write the new row, update the t_xmax of the original row, and write to the t_ctid field of the original. This has the effect of creating a short lived row pointer. Eventually this will be cleaned up by the VACUUM process, but it’s worth noting. The t_ctid is 6 bytes – 4 bytes to store the BlockIdData (the physical location of the page in the database file) and the OffsetNumber (a 1-based index to a row on a disk page). Combining these two gives us the coordinates of the row on a page. Pretty nifty, eh? Here, the t_ctid is the same as the current row because no updates have occurred.

What's in a page - the second part

Both t_infomask2 and t_infomask are flags and attributes. There is a lot of information that can be contained about NULLs, variable width columns, the current state of the row (is it in a transaction, the state of any transaction, how the row arrived at this location). This is documented very well in the source code. Normally I wouldn’t say that, but the source is fantastic and I’d just be repeating it here.

The last two items before the row data are t_hoff and t_bits. The t_hoff field is the header offset – this tells us where we can find the beginning of data inside the row. This is important because t_bits is a variable length bitmap of null values. The value stored in t_hoff is the size of the header including the size of the t_bits bitmap plus any associated padding that may be lurking about in the row.

We finally get to the data. There are two 4 byte integers. For this first row (id: 1, key: 1), both integers are stored as 0x00000001 (they’re both 1). If you look in the background of either picture, you’ll see how the values increase with each row.

What the Hell Did I Just Say?

The original question on Stack Overflow was asking “if data is a b-tree, how the devil is it stored on disk?” The poster was a bit confused and thought that composite keys would be stored as a tree and might look something like this (ASCII art is great):

Key:           1
                    |
         +----+-----+----+
         |    |     |    |
Value: 1    2     3    4

We just went and looked at the table structure in PostgreSQL and we know that’s not the case for the heap structure. Every value in every column is stored in the heap.

What About the Index?

Yes, what about that index?

If I had never created that primary key, we would have nothing to worry about – the table would be a heap and that’s it. Since I created the primary key (and PostgreSQL created a unique index to support that primary key), we have to worry about what the index looks like. Instead, we’re stuck with this index that we have to figure out.

Thankfully we have some functions that make it easy to see how the index is put together. We’re only going to look at bt_page_items. The other functions are interesting and let you see how the index page is put together but, frankly, we’re more worried about the actual index data itself.

SELECT * FROM bt_page_items('pk_tree', 1);

  itemoffset | ctid  | itemlen | nulls | vars |          data           
------------+-------+---------+-------+------+-------------------------
              1 | (0,1) |      16 | f     | f    | 01 00 00 00 01 00 00 00
              2 | (0,2) |      16 | f     | f    | 01 00 00 00 02 00 00 00
              3 | (0,3) |      16 | f     | f    | 01 00 00 00 03 00 00 00
              4 | (0,4) |      16 | f     | f    | 01 00 00 00 04 00 00 00

The itemoffset is similar to the lp_off that we used earlier – it tells us the offset of this particular b-tree record within the page. We can use a combination of itemoffset and itemlen to calculate the location of the row in the page. Pretty slick, eh?

The next column, ctid helps PostgreSQL identify the current version of the row in the heap to read from once all of the lookups are done. ctid is stored on the b-tree page as two 2 byte integers. The first integer is the page the row exists on and the second integer is the row’s offset on the page. The t_ctid field in the heap row’s header may point to a newer version of the row.

Whenever possible, PostgreSQL will perform a HOT update (Heap-Only Tuple update) and only update the heap record. Why? It’s a heck of a lot cheaper than updating rows inside indexes. The MVCC model doesn’t extend to indexes, so modifying data on a b-tree page is more expensive than modifying data in a heap.

There is also a bitmask that lets us know if there are any nulls or variable length columns in the b-tree (just like in the heap).

Finally, we come to the data. If you look back at the data in the heap pages and compare it to the data in the b-tree page, it’s exactly the same. PostgreSQL isn’t doing anything fancy at the leaf-level of the index to store lookup data in any kind of tree structure. Our data is being written in the same format we’d expect at the page level.

Multi-Level B-Trees

Things are a bit different when you look at a b-tree with multiple levels. WTF is a multi-level b-tree?

B-trees are a data structure that makes it easy to find the particular piece of data that you’re looking for. To make these lookups easier, b-trees are structured into levels. The top level is like the markings in a dictionary pointing to a letter: they make it easy to know roughly where word might be located. Then there is an intermediate level that’s like the guide at the top of the page telling you that this page contains ‘aardvark – anteater’.

PostgreSQL (as with most RDBMSes) uses a variant of the b-tree called a B+ tree. B+ trees are optimized for block operations and contain additional optimizations that improve performance. As a result, they’re wonderful for databases, filesystems, and anything else that has to find and read data in large chunks.

What’s the B+ Tree Look Like?

Enough about the theory, let’s see some bytes!

Since the first table wasn’t large enough, I created a second table with much more data:

CREATE TABLE tree2 (key int NOT NULL, id int NOT NULL);
ALTER TABLE tree2 ADD CONSTRAINT pk_tree2 PRIMARY KEY (key, id);

DO $$
BEGIN
  FOR i IN 1..10000 LOOP

    FOR j in 1..100 LOOP
      INSERT INTO public.tree2 VALUES (i,j);
    END LOOP;

  END LOOP;
END
$$ LANGUAGE 'plpgsql';

Once we have 1,000,000 rows in the table, we can take a look at the index meta-data to see what we’re dealing with:

SELECT * FROM bt_metap('pk_tree2');

 magic  | version | root | level | fastroot | fastlevel 
--------+---------+------+-------+----------+-----------
 340322 |       2 |  412 |     2 |      412 |         2

From the bt_metap function, we can tell that there are 2 levels in the table and the root of the tree is on page 412. Let’s look at what’s on this root page:

SELECT * FROM bt_page_items('pk_tree2', 412);

 itemoffset |   ctid   | itemlen | nulls | vars |          data           
------------+----------+---------+-------+------+-------------------------
          1 | (3,1)    |       8 | f     | f    | 
          2 | (411,1)  |      16 | f     | f    | 14 04 00 00 0b 00 00 00
          3 | (698,1)  |      16 | f     | f    | 27 08 00 00 15 00 00 00
          4 | (984,1)  |      16 | f     | f    | 3a 0c 00 00 1f 00 00 00
          5 | (1270,1) |      16 | f     | f    | 4d 10 00 00 29 00 00 00
          6 | (1556,1) |      16 | f     | f    | 60 14 00 00 33 00 00 00
          7 | (1842,1) |      16 | f     | f    | 73 18 00 00 3d 00 00 00
          8 | (2128,1) |      16 | f     | f    | 86 1c 00 00 47 00 00 00
          9 | (2414,1) |      16 | f     | f    | 99 20 00 00 51 00 00 00
         10 | (2700,1) |      16 | f     | f    | ac 24 00 00 5b 00 00 00

This is showing us how data is laid out throughout the index. We can look on pages 3, 411, 698, 984, 1270, 1556, 1842, 2128, 2414, and 2700 and see what data is on the intermediate (non-root, non-leaf) pages of the index. It’s similar to the data in the root of the index: it tells PostgreSQL how to find rows that fall within a certain range in the index. Whenever we finally drill down to the leaf level itself, we’ll see something that looks a lot like what we saw when we looked at the index pk_tree.

Summary

Long story short, PostgreSQL does not perform any shenanigans or trickery to compact the data on disk. If we have what seems like a hierarchical index on two columns (parent, child), all values will still be stored on disk. Data is stored in a heap structure that is supported by additional indexes that speed up lookups.

Comments

13 Comments so far. Comments are closed.
  1. Thom Brown,

    Following your exact example, I looked at the raw page data from offset 8160 for 32 bytes, and it consisted entirely of zeros. Am I missing something here?

    postgres=# select substring(get_raw_page::text,8160,32) from get_raw_page('tree', 0);
    substring
    ----------------------------------
    00000000000000000000000000000000
    (1 row)

    • Ah, you ran into this, too. I had to take the output of that function and paste it into my hex editor. Once I did that the offsets were correct. The text output is 16k characters of hex, which corresponds to 8k bytes. If that doesn’t work for you, let me know and I’ll put together a screencast of how I got there.

  2. I found this very informative. I was surprised how easily you could dig into the code.

  3. I was not sure you were aware but I have diagrams for many of these structures in my presentation “PostgreSQL Internals Through Pictures” at http://momjian.us/main/presentations/internals.html#internal_pics .

    • Thanks for sharing the links. I’ve run across the Interals through Pictures presentation before, but I couldn’t remember what it was called and couldn’t find it again. I’ve also been using How PostgreSQL Processes A Query to help me understand what PostgreSQL is doing under the covers (as well as get a crash course in C).

      The documentation around PostgreSQL is nothing short of impressive, especially where the database internals are concerned.

  4. Oh, and this talk has more diagrams:

    PostgreSQL Internals Through Pictures – Presentation
    http://momjian.us/main/presentations/internals.html#shared_memory

  5. srinivas,

    thanq very much Jeremiah Peschka, Bruce Momjian for sharing your knowledge. thanks alot.

  6. dezso,

    Great stuff!
    Here: “The t_ctid is a composite field” did you mean t_cid?

  7. dezso,

    This one keeps exciting me :) I tried to figure out how to make the connection between index and table rows (say, a ctid to ctid relation). Is there any other way than comparing data?

    • Sadly, the obvious and seemingly laborious solution is what we have to do to find table rows back from the index pages. You can also make use of bt_page_items to locate the page and offset of a particular index row in the underlying heap. That’s about the best that I can think of off the top of my head. You can check out my brief dive into PostgreSQL Update Internals. If the lack of pictures poses a problem, let me know and I’ll track them down. This blog has fallen into disarray as of late.

      Ultimately, this is one of the dangers of heaps as an underlying data structure – they’re incredibly fast to write into, but finding a row in a heap can be a chore when there are multiple updates.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 531 items have been purified.