I answered a question a few days ago on Stack Overflow about PostgreSQL & SQL Server btree storage fundamentals. After I answered the question, I started thinking more and more about how PostgreSQL stores data on the row. Rather than be content to live in ignorance, I went digging through PostgreSQL’s source code and some hex dumps to find out what was going on inside a row.
To get started, we’ll create a separate test database that we’ll cunningly call
createdb test --username=jeremiah psql test -f pageinspect.sql
The next step is to set up our sample data. Here we create a small sample table with four rows:
CREATE TABLE tree (key int NOT NULL, id int NOT NULL); ALTER TABLE tree ADD CONSTRAINT pk_tree PRIMARY KEY (key, id); INSERT INTO TREE (Key, ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4);
You can copy this and run it in psql or any other tool that lets you connect to PostgreSQL and run commands. When we run the
ALTER TABLE to create the primary key PostgreSQL will go ahead and create an index for us:
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "pk_tree" for table "tree". SQL Server will do something similar. The implementation is different, but we’ll get to that in a bit.
Read page items
In order to view data on the page we have two options. Option 1: shutdown PostgreSQL, open our copy of the source, fire up our trusty hex editor, and then hate life for a long time. Option 2: use the
pageinspect contrib module to give you some extra functionality that will help us read page data.
heap_page_items will show us the data in the table. PostgreSQL stores table data in unordered heaps, hence the name
heap_page_items function takes one input: a byte array representing the page. Thankfully, we can get this from using
get_raw_page. This is what the first row of output will look like (magically formatted for your display):
SELECT * FROM heap_page_items(get_raw_page('tree', 0));
-[ RECORD 1 ]------ lp | 1 lp_off | 8160 lp_flags | 1 lp_len | 32 t_xmin | 1019 t_xmax | 0 t_field3 | 0 t_ctid | (0,1) t_infomask2 | 2 t_infomask | 2048 t_hoff | 24 t_bits | (NULL) t_oid | (NULL)
That’s a lot of gibberish, but it’s going to help us find our row data when we look at the raw page in a hex editor.
Reading the Raw Page
While we’re on the subject, let’s look at the raw page:
SELECT get_raw_page::text FROM get_raw_page('tree', 0);
This will look like pure gibberish (unless you like reading hex dumps_. I took the output and pasted it into a hex editor. It’s still gibberish, but at least we can hope to read this with the hex editor making life easier.
To find the first row that we inserted (1,1) we’re going to use the
heap_page_items to get the offset from the start of the database page This is where we are going to start staring blankly. In order to figure out just how much data we need to look at, the
lp_len column helpfully tells us the length in bytes of the row (including header). To look at the first row that we inserted, we want to start at byte 8160 and look at the next 32 bytes; we’ll be reading the last 32 bytes in the page.
Looking at the Row
Now that we know where to look, we can look at the row structure of the row. What we’re really interested in is the data on the page, but we’re going to take a look at the row header first.
PostgreSQL’s row structure is covered in detail in the Database Page Layoutdocumentation, but let’s have some fun and break down the technical jargon.
The first 4 bytes are a TransactionId:
t_xmin. This is the first transaction where our row is visible. Since PostgreSQL uses MVCC for data management in transactions,
t_xminwill refer to the transaction when this particular row was inserted. In my case, this row was inserted during transaction id 1019.
You might notice that a lot of these columns start with
t_That’s because they’re columns that refer to the current tuple.
lp_refers to a line pointer. This is the pointer to the location on the page.
The next 4 bytes are another TransactionId:
t_xmax. Logically, if
t_xmin is the first time a row was visible,
t_xmax would refer to the last time a row is visible. This is the transaction ID that identifies when this row was deleted. One of the interesting things about PostgreSQL’s implementation is that there may be many copies of a row at the same time in the database. These rows will all have different
t_xmax values. Over time PostgreSQL’s vacuum process will remove the deleted rows from the table.
t_xmax headers, we’ve got another 4 bytes that
heap_page_bytes refers to as
t_field3. Thankfully, we can take a look at the source code, which has far more information for us. The developers have provided a lot more detail to make it easier to understand what’s going on .
t_field3 can actually contain two pieces of information:
t_xvac. This field will contain either a pointer to the current version row (it could even refer to itself) or else the transaction ID that deleted this row. In my case, this field is
0x00000000. Initially I was confused by this and couldn’t figure out what it meant. Then I started digging around in the code again (see how this helps?) and I found another gem of a comment referring to what is really going on. The
t_cid is a composite field that can be used to backtrack to a min and max command but is typically used when inserting or deleting so we can backtrack to the original version of the row. The min and max command are only interesting within context of that transaction. The other information that can be stored in
t_field3 is an Xvac identifier. This is an older feature that tracks the transaction id of the older
VACUUM FULL command.
t_field3 is blank in this case because there is not a transaction in progress.
The next field seems somewhat similar:
t_ctid. This stores a pointer to the current version of this row. When a row is updated, PostgreSQL will write a new version of the row and mark the old version as deleted (a new version of the row is written and the original row’s
t_xmax field will gets a value). To make minimal changes on disk PostgreSQL will write the new row, update the
t_xmax of the original row, and write to the
t_ctid field of the original. This has the effect of creating a short lived row pointer. Eventually this will be cleaned up by the
VACUUM process, but it’s worth noting. The
t_ctid is 6 bytes – 4 bytes to store the
BlockIdData (the physical location of the page in the database file) and the
OffsetNumber (a 1-based index to a row on a disk page). Combining these two gives us the coordinates of the row on a page. Pretty nifty, eh? Here, the
t_ctid is the same as the current row because no updates have occurred.
t_infomask are flags and attributes. There is a lot of information that can be contained about NULLs, variable width columns, the current state of the row (is it in a transaction, the state of any transaction, how the row arrived at this location). This is documented very well in the source code. Normally I wouldn’t say that, but the source is fantastic and I’d just be repeating it here.
The last two items before the row data are
t_hoff field is the header offset – this tells us where we can find the beginning of data inside the row. This is important because
t_bits is a variable length bitmap of null values. The value stored in
t_hoff is the size of the header including the size of the
t_bits bitmap plus any associated padding that may be lurking about in the row.
We finally get to the data. There are two 4 byte integers. For this first row (id: 1, key: 1), both integers are stored as
0x00000001 (they’re both 1). If you look in the background of either picture, you’ll see how the values increase with each row.
What the Hell Did I Just Say?
The original question on Stack Overflow was asking “if data is a b-tree, how the devil is it stored on disk?” The poster was a bit confused and thought that composite keys would be stored as a tree and might look something like this (ASCII art is great):
Key: 1 | +----+-----+----+ | | | | Value: 1 2 3 4
We just went and looked at the table structure in PostgreSQL and we know that’s not the case for the heap structure. Every value in every column is stored in the heap.
What About the Index?
Yes, what about that index?
If I had never created that primary key, we would have nothing to worry about – the table would be a heap and that’s it. Since I created the primary key (and PostgreSQL created a unique index to support that primary key), we have to worry about what the index looks like. Instead, we’re stuck with this index that we have to figure out.
Thankfully we have some functions that make it easy to see how the index is put together. We’re only going to look at
bt_page_items. The other functions are interesting and let you see how the index page is put together but, frankly, we’re more worried about the actual index data itself.
SELECT * FROM bt_page_items('pk_tree', 1);
itemoffset | ctid | itemlen | nulls | vars | data ------------+-------+---------+-------+------+------------------------- 1 | (0,1) | 16 | f | f | 01 00 00 00 01 00 00 00 2 | (0,2) | 16 | f | f | 01 00 00 00 02 00 00 00 3 | (0,3) | 16 | f | f | 01 00 00 00 03 00 00 00 4 | (0,4) | 16 | f | f | 01 00 00 00 04 00 00 00
itemoffset is similar to the
lp_off that we used earlier – it tells us the offset of this particular b-tree record within the page. We can use a combination of
itemlen to calculate the location of the row in the page. Pretty slick, eh?
The next column,
ctid helps PostgreSQL identify the current version of the row in the heap to read from once all of the lookups are done.
ctid is stored on the b-tree page as two 2 byte integers. The first integer is the page the row exists on and the second integer is the row’s offset on the page. The
t_ctidfield in the heap row’s header may point to a newer version of the row.
Whenever possible, PostgreSQL will perform a HOT update (Heap-Only Tuple update) and only update the heap record. Why? It’s a heck of a lot cheaper than updating rows inside indexes. The MVCC model doesn’t extend to indexes, so modifying data on a b-tree page is more expensive than modifying data in a heap.
There is also a bitmask that lets us know if there are any nulls or variable length columns in the b-tree (just like in the heap).
Finally, we come to the data. If you look back at the data in the heap pages and compare it to the data in the b-tree page, it’s exactly the same. PostgreSQL isn’t doing anything fancy at the leaf-level of the index to store lookup data in any kind of tree structure. Our data is being written in the same format we’d expect at the page level.
Things are a bit different when you look at a b-tree with multiple levels. WTF is a multi-level b-tree?
B-trees are a data structure that makes it easy to find the particular piece of data that you’re looking for. To make these lookups easier, b-trees are structured into levels. The top level is like the markings in a dictionary pointing to a letter: they make it easy to know roughly where word might be located. Then there is an intermediate level that’s like the guide at the top of the page telling you that this page contains ‘aardvark – anteater’.
PostgreSQL (as with most RDBMSes) uses a variant of the b-tree called a B+ tree. B+ trees are optimized for block operations and contain additional optimizations that improve performance. As a result, they’re wonderful for databases, filesystems, and anything else that has to find and read data in large chunks.
What’s the B+ Tree Look Like?
Enough about the theory, let’s see some bytes!
Since the first table wasn’t large enough, I created a second table with much more data:
CREATE TABLE tree2 (key int NOT NULL, id int NOT NULL); ALTER TABLE tree2 ADD CONSTRAINT pk_tree2 PRIMARY KEY (key, id); DO $$ BEGIN FOR i IN 1..10000 LOOP FOR j in 1..100 LOOP INSERT INTO public.tree2 VALUES (i,j); END LOOP; END LOOP; END $$ LANGUAGE 'plpgsql';
Once we have 1,000,000 rows in the table, we can take a look at the index meta-data to see what we’re dealing with:
SELECT * FROM bt_metap('pk_tree2');
magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 412 | 2 | 412 | 2
bt_metap function, we can tell that there are 2 levels in the table and the root of the tree is on page 412. Let’s look at what’s on this root page:
SELECT * FROM bt_page_items('pk_tree2', 412);
itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (411,1) | 16 | f | f | 14 04 00 00 0b 00 00 00 3 | (698,1) | 16 | f | f | 27 08 00 00 15 00 00 00 4 | (984,1) | 16 | f | f | 3a 0c 00 00 1f 00 00 00 5 | (1270,1) | 16 | f | f | 4d 10 00 00 29 00 00 00 6 | (1556,1) | 16 | f | f | 60 14 00 00 33 00 00 00 7 | (1842,1) | 16 | f | f | 73 18 00 00 3d 00 00 00 8 | (2128,1) | 16 | f | f | 86 1c 00 00 47 00 00 00 9 | (2414,1) | 16 | f | f | 99 20 00 00 51 00 00 00 10 | (2700,1) | 16 | f | f | ac 24 00 00 5b 00 00 00
This is showing us how data is laid out throughout the index. We can look on pages 3, 411, 698, 984, 1270, 1556, 1842, 2128, 2414, and 2700 and see what data is on the intermediate (non-root, non-leaf) pages of the index. It’s similar to the data in the root of the index: it tells PostgreSQL how to find rows that fall within a certain range in the index. Whenever we finally drill down to the leaf level itself, we’ll see something that looks a lot like what we saw when we looked at the index
Long story short, PostgreSQL does not perform any shenanigans or trickery to compact the data on disk. If we have what seems like a hierarchical index on two columns (parent, child), all values will still be stored on disk. Data is stored in a heap structure that is supported by additional indexes that speed up lookups.