Category pg_syndication

PostgreSQL Update Internals

I recently covered the internals of a row in PostgreSQL, but that was just the storage piece. I got more curious and decided that I would look into what happens when a row gets updated. There are a lot of complexities to data, after all, and it’s nice to know how our database is going to be affected by updates.

Getting Set Up

I started by using the customer table from the pagila sample database. Rather than come up with a set of sample data, I figured it would be easy to work within an existing set of data.

The first trick was to find a customer to update. Since the goal is to look at an existing row, update it, and then see what happens to the row, we’ll need to be able to locate the row again. This is actually pretty easy to do. The first thing I did was retrieve the ctid along with the rest of the data in the row. I did this by running:

SELECT  ctid, *
FROM    customer
ORDER BY ctid 
LIMIT 10 ;

The first ten rows

This gives us the primary key of a customer to mess with as well as the location of the row on disk. We’re going to be looking at the customer with a customer_id of 1: Mary Smith. Using that select statement, we can see that Mary Smith’s data lives on page 0 and in row 1

Updating a Row

Now that we know who we’re going to update, we can go ahead and mess around with the data. We can take a look at the row on the disk using the get_raw_page function to examine page 0 of the customer table. Mary Smith’s data is at the end of the page.

Why is Mary’s data the first row in the table but the last entry on the page? PostgreSQL starts writing data from the end of the page but writes item identifiers from the beginning of the page.

We already know that Mary’s row is in page 0, position 1 because of the ctid we retrieved in our first query. Let’s see what happens when we update some of Mary’s data. Open up a connection to PostgreSQL using your favorite interactive querying tool. I use psql on the command prompt, but there are plenty of great tools out there.


UPDATE  customer
SET     email = ''
WHERE   customer_id = 1 ;

Don’t commit the transaction yet!

When we go to look for Mary’s data using the first select ordered by ctid, we won’t see her data anywhere.

The first ten rows, after the update

Where did her data go? Interestingly enough, it’s in two places right now because we haven’t committed the transaction. In the current query window, run the following command:

 SELECT ctid, xmin, xmax, * FROM customer WHERE customer_id = 1;

The data has moved to a new page

After running this, we can see that the customer’s row has moved off of page 0 and is now on page 8 in slot 2. The other interesting thing to note is that the xmin value has changed. Transactions with a transaction id lower than xmin won’t be able to see the row.

In another query window, run the previous select again. You’ll see that the row is still there with all of the original data present; the email address hasn’t changed. We can also see that both the xmin and xmax columns now have values. This shows us the range of transactions where this row is valid.

The row after an update, from a different transaction

Astute readers will have noticed that the row is on disk in two places at the same time. We’re going to dig into this in a minute, but for now go ahead and commit that first transaction. This is important because we want to look at what’s going on with the row after the update is complete. Looking at rows during the update process is interesting, but the after effects are much more interesting.

Customer Page 0 - After the Update

Looking at page 0 of the customer table, we can see that the original row is still present. It hasn’t been deleted yet. However, PostgreSQL has marked the row as being “old” by setting the xmax value as well as setting the t_ctid value to 00 00 00 08 00 02. This tells us that if we look on page 8 in position 2 we’ll find the newest version of the data that corresponds to Mary Smith. Eventually this old row (commonly called a dead tuple) will be cleaned up by the vacuum process.

Customer Page 8 - After the Update

If we update the row again, we’ll see that it moves to a new position on the page, from (8,2) to (8,3). If we did back in and look at the row, we’ll see that the t_ctid value in Mary Smith’s record at page 8, slot 2 is updated from 00 00 00 00 00 00 to 00 00 00 08 00 03. We can even see the original row in the hex dump from page 8. We can see the same information much more legibly by using the heap_page_items function:

select * from heap_page_items(get_raw_page('customer', 8));

There are three rows listed on the page. The row with lp 1 is the row that was originally on this page before we started messing around with Mary Smith’s email address. lp 2 is the first update to Mary’s email address.

Looking at t_infomask2 on row 2 we can immediately see two things… I lied, I can’t immediately see anything apart from some large number. But, once I applied the bitmap deciphering technology that I call “swear repeatedly”, I was able to determine that this row was HEAP_HOT_UPDATED and contains 10 attributes. Refer to htup.h for more info about the bitmap meanings.

The HOTness

PostgreSQL has a unique feature called heap only tuples (HOT for short). The HOT mechanism is designed to minimize the load on the database server in certain update conditions:

  1. A tuple is repeatedly updated
  2. The updates do not change indexed columns

For definition purposes, an “indexed column” includes any columns in the index definition, whether they are directly indexes or are used in a partial-index predicate. If your index definition mentions it, it counts.

In our case, there are no indexes on the email column of the customer table. The updates we’ve done are going to be HOT updates since they don’t touch any indexed columns. Whenever we update a new row, PostgreSQL is going to write a new version of the row and update the t_ctid column in the most current row.

When we read from an index, PostgreSQL is going to read from the index and then follow the t_ctid chains to find the current version of the row. This lets us avoid additional hits to disk when we’re updating rows. PostgreSQL just updates the row pointers. The indexes will still point to the original row, which points to the most current version of the row. We potentially take an extra hit on read, but we save on write.

To verify this, we can look at the index page contents using the bt_page_items function:

FROM    bt_page_items('idx_last_name', 2) 

We can find our record by moving through the different pages of the index. I found the row on page 2. We can locate our index row by matching up the ctid from earlier runs. Looking at that row, we can see that it points to the ctid of a row with a forwarding ctid. PostgreSQL hasn’t changed the index at all. Instead, when we do a look up based on idx_last_name, we’ll read from the index, locate any tuples with a last name of ‘SMITH’, and then look for those rows in the heap page. When we get to the heap page, we’ll find that the tuple has been updated. We’ll follow the update chain until we get to the most recent tuple and return that data.

If you want to find out more about the workings of the Heap Only Tuples feature of PostgreSQL, check out the README.

PostgreSQL Row Storage Fundamentals

I answered a question a few days ago on Stack Overflow about PostgreSQL & SQL Server btree storage fundamentals. After I answered the question, I started thinking more and more about how PostgreSQL stores data on the row. Rather than be content to live in ignorance, I went digging through PostgreSQL’s source code and some hex dumps to find out what was going on inside a row.

Getting Started

To get started, we’ll create a separate test database that we’ll cunningly call test.

createdb test --username=jeremiah
psql test -f pageinspect.sql

The next step is to set up our sample data. Here we create a small sample table with four rows:

CREATE TABLE tree (key int NOT NULL, id int NOT NULL);
INSERT INTO TREE (Key, ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4);

You can copy this and run it in psql or any other tool that lets you connect to PostgreSQL and run commands. When we run the ALTER TABLE to create the primary key PostgreSQL will go ahead and create an index for us: NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "pk_tree" for table "tree". SQL Server will do something similar. The implementation is different, but we’ll get to that in a bit.

Read page items

In order to view data on the page we have two options. Option 1: shutdown PostgreSQL, open our copy of the source, fire up our trusty hex editor, and then hate life for a long time. Option 2: use the pageinspect contrib module to give you some extra functionality that will help us read page data.

heap_page_items will show us the data in the table. PostgreSQL stores table data in unordered heaps, hence the name heap_page_items. The heap_page_items function takes one input: a byte array representing the page. Thankfully, we can get this from using get_raw_page. This is what the first row of output will look like (magically formatted for your display):

SELECT * FROM heap_page_items(get_raw_page('tree', 0));
-[ RECORD 1 ]------
lp          | 1
lp_off      | 8160
lp_flags    | 1
lp_len      | 32
t_xmin      | 1019
t_xmax      | 0
t_field3    | 0
t_ctid      | (0,1)
t_infomask2 | 2
t_infomask  | 2048
t_hoff      | 24
t_bits      | (NULL)
t_oid       | (NULL)

That’s a lot of gibberish, but it’s going to help us find our row data when we look at the raw page in a hex editor.

Reading the Raw Page

While we’re on the subject, let’s look at the raw page:

SELECT  get_raw_page::text
FROM    get_raw_page('tree', 0);

This will look like pure gibberish (unless you like reading hex dumps_. I took the output and pasted it into a hex editor. It’s still gibberish, but at least we can hope to read this with the hex editor making life easier.

To find the first row that we inserted (1,1) we’re going to use the lp_off from heap_page_items to get the offset from the start of the database page This is where we are going to start staring blankly. In order to figure out just how much data we need to look at, the lp_len column helpfully tells us the length in bytes of the row (including header). To look at the first row that we inserted, we want to start at byte 8160 and look at the next 32 bytes; we’ll be reading the last 32 bytes in the page.

Looking at the Row

Now that we know where to look, we can look at the row structure of the row. What we’re really interested in is the data on the page, but we’re going to take a look at the row header first.

PostgreSQL’s row structure is covered in detail in the Database Page Layout documentation, but let’s have some fun and break down the technical jargon.

What's in a page - the first part

The first 4 bytes are a TransactionId: t_xmin. This is the first transaction where our row is visible. Since PostgreSQL uses MVCC for data management in transactions, t_xmin will refer to the transaction when this particular row was inserted. In my case, this row was inserted during transaction id 1019.

You might notice that a lot of these columns start with t_ That’s because they’re columns that refer to the current tuple. lp_ refers to a line pointer. This is the pointer to the location on the page.

The next 4 bytes are another TransactionId: t_xmax. Logically, if t_xmin is the first time a row was visible, t_xmax would refer to the last time a row is visible. This is the transaction ID that identifies when this row was deleted. One of the interesting things about PostgreSQL’s implementation is that there may be many copies of a row at the same time in the database. These rows will all have different t_xmin and t_xmax values. Over time PostgreSQL’s vacuum process will remove the deleted rows from the table.

After the t_xmin and t_xmax headers, we’ve got another 4 bytes that heap_page_bytes refers to as t_field3. Thankfully, we can take a look at the source code, which has far more information for us. The developers have provided a lot more detail to make it easier to understand what’s going on . t_field3 can actually contain two pieces of information: t_cid or t_xvac. This field will contain either a pointer to the current version row (it could even refer to itself) or else the transaction ID that deleted this row. In my case, this field is 0x00000000. Initially I was confused by this and couldn’t figure out what it meant. Then I started digging around in the code again (see how this helps?) and I found another gem of a comment referring to what is really going on. The t_cid is a composite field that can be used to backtrack to a min and max command but is typically used when inserting or deleting so we can backtrack to the original version of the row. The min and max command are only interesting within context of that transaction. The other information that can be stored in t_field3 is an Xvac identifier. This is an older feature that tracks the transaction id of the older VACUUM FULL command. t_field3 is blank in this case because there is not a transaction in progress.

The next field seems somewhat similar: t_ctid. This stores a pointer to the current version of this row. When a row is updated, PostgreSQL will write a new version of the row and mark the old version as deleted (a new version of the row is written and the original row’s t_xmax field will gets a value). To make minimal changes on disk PostgreSQL will write the new row, update the t_xmax of the original row, and write to the t_ctid field of the original. This has the effect of creating a short lived row pointer. Eventually this will be cleaned up by the VACUUM process, but it’s worth noting. The t_ctid is 6 bytes – 4 bytes to store the BlockIdData (the physical location of the page in the database file) and the OffsetNumber (a 1-based index to a row on a disk page). Combining these two gives us the coordinates of the row on a page. Pretty nifty, eh? Here, the t_ctid is the same as the current row because no updates have occurred.

What's in a page - the second part

Both t_infomask2 and t_infomask are flags and attributes. There is a lot of information that can be contained about NULLs, variable width columns, the current state of the row (is it in a transaction, the state of any transaction, how the row arrived at this location). This is documented very well in the source code. Normally I wouldn’t say that, but the source is fantastic and I’d just be repeating it here.

The last two items before the row data are t_hoff and t_bits. The t_hoff field is the header offset – this tells us where we can find the beginning of data inside the row. This is important because t_bits is a variable length bitmap of null values. The value stored in t_hoff is the size of the header including the size of the t_bits bitmap plus any associated padding that may be lurking about in the row.

We finally get to the data. There are two 4 byte integers. For this first row (id: 1, key: 1), both integers are stored as 0x00000001 (they’re both 1). If you look in the background of either picture, you’ll see how the values increase with each row.

What the Hell Did I Just Say?

The original question on Stack Overflow was asking “if data is a b-tree, how the devil is it stored on disk?” The poster was a bit confused and thought that composite keys would be stored as a tree and might look something like this (ASCII art is great):

Key:           1
         |    |     |    |
Value: 1    2     3    4

We just went and looked at the table structure in PostgreSQL and we know that’s not the case for the heap structure. Every value in every column is stored in the heap.

What About the Index?

Yes, what about that index?

If I had never created that primary key, we would have nothing to worry about – the table would be a heap and that’s it. Since I created the primary key (and PostgreSQL created a unique index to support that primary key), we have to worry about what the index looks like. Instead, we’re stuck with this index that we have to figure out.

Thankfully we have some functions that make it easy to see how the index is put together. We’re only going to look at bt_page_items. The other functions are interesting and let you see how the index page is put together but, frankly, we’re more worried about the actual index data itself.

SELECT * FROM bt_page_items('pk_tree', 1);

  itemoffset | ctid  | itemlen | nulls | vars |          data           
              1 | (0,1) |      16 | f     | f    | 01 00 00 00 01 00 00 00
              2 | (0,2) |      16 | f     | f    | 01 00 00 00 02 00 00 00
              3 | (0,3) |      16 | f     | f    | 01 00 00 00 03 00 00 00
              4 | (0,4) |      16 | f     | f    | 01 00 00 00 04 00 00 00

The itemoffset is similar to the lp_off that we used earlier – it tells us the offset of this particular b-tree record within the page. We can use a combination of itemoffset and itemlen to calculate the location of the row in the page. Pretty slick, eh?

The next column, ctid helps PostgreSQL identify the current version of the row in the heap to read from once all of the lookups are done. ctid is stored on the b-tree page as two 2 byte integers. The first integer is the page the row exists on and the second integer is the row’s offset on the page. The t_ctid field in the heap row’s header may point to a newer version of the row.

Whenever possible, PostgreSQL will perform a HOT update (Heap-Only Tuple update) and only update the heap record. Why? It’s a heck of a lot cheaper than updating rows inside indexes. The MVCC model doesn’t extend to indexes, so modifying data on a b-tree page is more expensive than modifying data in a heap.

There is also a bitmask that lets us know if there are any nulls or variable length columns in the b-tree (just like in the heap).

Finally, we come to the data. If you look back at the data in the heap pages and compare it to the data in the b-tree page, it’s exactly the same. PostgreSQL isn’t doing anything fancy at the leaf-level of the index to store lookup data in any kind of tree structure. Our data is being written in the same format we’d expect at the page level.

Multi-Level B-Trees

Things are a bit different when you look at a b-tree with multiple levels. WTF is a multi-level b-tree?

B-trees are a data structure that makes it easy to find the particular piece of data that you’re looking for. To make these lookups easier, b-trees are structured into levels. The top level is like the markings in a dictionary pointing to a letter: they make it easy to know roughly where word might be located. Then there is an intermediate level that’s like the guide at the top of the page telling you that this page contains ‘aardvark – anteater’.

PostgreSQL (as with most RDBMSes) uses a variant of the b-tree called a B+ tree. B+ trees are optimized for block operations and contain additional optimizations that improve performance. As a result, they’re wonderful for databases, filesystems, and anything else that has to find and read data in large chunks.

What’s the B+ Tree Look Like?

Enough about the theory, let’s see some bytes!

Since the first table wasn’t large enough, I created a second table with much more data:

CREATE TABLE tree2 (key int NOT NULL, id int NOT NULL);

DO $$
  FOR i IN 1..10000 LOOP

    FOR j in 1..100 LOOP
      INSERT INTO public.tree2 VALUES (i,j);

$$ LANGUAGE 'plpgsql';

Once we have 1,000,000 rows in the table, we can take a look at the index meta-data to see what we’re dealing with:

SELECT * FROM bt_metap('pk_tree2');

 magic  | version | root | level | fastroot | fastlevel 
 340322 |       2 |  412 |     2 |      412 |         2

From the bt_metap function, we can tell that there are 2 levels in the table and the root of the tree is on page 412. Let’s look at what’s on this root page:

SELECT * FROM bt_page_items('pk_tree2', 412);

 itemoffset |   ctid   | itemlen | nulls | vars |          data           
          1 | (3,1)    |       8 | f     | f    | 
          2 | (411,1)  |      16 | f     | f    | 14 04 00 00 0b 00 00 00
          3 | (698,1)  |      16 | f     | f    | 27 08 00 00 15 00 00 00
          4 | (984,1)  |      16 | f     | f    | 3a 0c 00 00 1f 00 00 00
          5 | (1270,1) |      16 | f     | f    | 4d 10 00 00 29 00 00 00
          6 | (1556,1) |      16 | f     | f    | 60 14 00 00 33 00 00 00
          7 | (1842,1) |      16 | f     | f    | 73 18 00 00 3d 00 00 00
          8 | (2128,1) |      16 | f     | f    | 86 1c 00 00 47 00 00 00
          9 | (2414,1) |      16 | f     | f    | 99 20 00 00 51 00 00 00
         10 | (2700,1) |      16 | f     | f    | ac 24 00 00 5b 00 00 00

This is showing us how data is laid out throughout the index. We can look on pages 3, 411, 698, 984, 1270, 1556, 1842, 2128, 2414, and 2700 and see what data is on the intermediate (non-root, non-leaf) pages of the index. It’s similar to the data in the root of the index: it tells PostgreSQL how to find rows that fall within a certain range in the index. Whenever we finally drill down to the leaf level itself, we’ll see something that looks a lot like what we saw when we looked at the index pk_tree.


Long story short, PostgreSQL does not perform any shenanigans or trickery to compact the data on disk. If we have what seems like a hierarchical index on two columns (parent, child), all values will still be stored on disk. Data is stored in a heap structure that is supported by additional indexes that speed up lookups.

PostgreSQL Extensions

PostgreSQL Extensions

Dimitri Fontaine has been working on adding extension packaging functionality to PostgreSQL for a while. His hard work has paid off and PostgreSQL 9.1 will have extensions.

How will extensions work? At a cursory glance, extensions are similar to libraries or DLLs in the programming world. Extensions are packaged units of code that can be installed using the CREATE EXTENSION command. This command will execute a script file that installs any of the extensions objects, including any C code that needs to be built or loaded. If this was the only thing that extensions provided, it would be a nice feature. Interestingly, there’s more.

PostgreSQL’s extensions may consist of any number of database level objects – tables, functions, views, etc. These objects are versioned together as part of a discrete package. While PostgreSQL won’t let you drop any database object that’s part of an extension, you could go ahead and break an extension using CREATE OR REPLACE FUNCTION. Those changes won’t be tracked as part of the extension.

Once our extension has been installed, it’s going to be managed as a versioned chunk. The best way to make changes it to upgrade the extension using code supplied by the extension developer. The extension code lets developers supply a version number, as well as some other metadata. To be fair, the extension code in PostgreSQL requires that extension developers supply a version number; extensions without a version number will fail on creation with some kind of helpful error message about an invalid version name. But, once we’ve created an extension with a version number we get a few niceties.

  1. We can tell which version of an extension in installed in every database on our server by looking in the pg_extension table of each database.
  2. We can perform updates by comparing the current version to update scripts. Admittedly, ALTER EXTENSION looks like it should handle this for us, but it’s still cool.

What else do we get from the new extensions?

Configuration! Extensions have to ways of setting up their configuration: control files and configuration tables. Extension developers can supply limited configuration through a control file. The control file lets an extension developer list things like the name, a description, and any other extensions that are needed for proper functionality.

In addition to the control file, extension specific configuration information through configuration tables. Not only can we supply additional functionality, but it’s possible to make our functionality flexible. Like all important data, it’s important to back up the configuration. To make sure that users’ configuration information is backed up, extension developers can use the pg_extension_config_dump function in their extension installation code to mark extension tables that should be a part of back ups; PostgreSQL will ignore extension tables when pg_dump is run. Interestingly, pg_extension_config_dump takes a second parameter that will be used as a search predicate when pg_dump is run. This can be used to only backup user supplied parameters or work through whatever convoluted scheme you’d like (parameters are only backed up on bank holidays).

We also get extra control about where extensions live. Have you ever installed software from a vendor and been a bit annoyed about where it ends up in your database? Both the CREATE EXTENSION and ALTER EXTENSION commands make it easy to relocate an extension to a new schema. DBAs can opt to install extensions into one schema and move them into another at a later date. We could install a vendor library and move it to a safe schema at another time, for example ALTER EXTENSION magic_vendor_security_audit_tools SET SCHEMA vendor_audit;. ALTER EXTENSION will even update the definition of tables that already exist.

I suspect that the extension mechanism in PostgreSQL 9.1 is going to have far reaching effects for PostgreSQL. The PostgreSQL development team are already moving procedural languages into extensions and bundling additional functionality (formerly living in the contrib folder) into extensions. This addition to PostgreSQL should also make it much easier for software vendors to supply additional functionality for PostgreSQL.

PostgreSQL Tutorial – Inserting Data

In the last two articles, we talked about creating your first PostgreSQL database, the syntax of createdb and psql, and creating tables in the database. Next up, we’re going to look at filling those tables with data.

createdb -U jeremiah pagila
psql -U jeremiah -f pagila-schema.sql pagila 
psql -U jeremiah -f pagila-insert-data.sql pagila

Loading Data From A File

We’re going to be looking at the pagila-insert-data.sql file. There is a separate file, pagila-data.sql, but we will not be using it. Both files accomplish the same purpose, but pagila-insert-data.sql uses standard SQL syntax whereas pagila-data.sql uses the COPY command do the bulk of the work. Since our goal is to familiarize ourselves with basic SQL, we’re going to use pagila-insert-data.sql.

Preparation is Key

The first thing you will notice is that there are close to 100 lines of options and comments (comments are start with -- and end at the end of the current line) before we get to the actual meat of the file. This is because we need to set up common settings before we start working. Let’s take a look at these options line by line.

SET client_encoding = 'UTF8';

Strings will be treated as Unicode strings. This makes sure that characters from non-Latin alphabets can easily be inserted into the database.

SET standard_conforming_strings = off;

The standard_conforming_strings setting ensures some backwards compatibility with PostgreSQL behavior – we can use ‘' as an escape character in strings. When this is set to off, PostgreSQL will use the ‘' as an escape character. The goal is to, some day, default this setting to on.

SET check_function_bodies = false;

We are, in effect, recovering from a database dump. By setting check_function_bodies to false we tell PostgreSQL that we want to put off a lot of checks until later because things might not be completely set up at the moment and we’ll accept the risks of doing this, thanks. I would never recommend using this setting in a production environment unless you’re recovering from a data dump. If that’s the case, knock yourself out.

SET client_min_messages = warning;

This controls the amount of messages that we are going to receive from PostgreSQL. Leaving this alone is a good thing but it’s important to know that you can mute all messages from the server if you so desire.

SET escape_string_warning = off;

The escape_string_warning setting actually goes along with the standard_conforming_strings setting and has to do with how PostgreSQL will behave when it encounters a ‘' character rampaging about in our code.

SET search_path = public, pg_catalog;

There’s a history in the PostgreSQL world of omitting the schema qualification from table names in queries. So, what we’re doing here, is telling PostgreSQL that we would like to look for table first in the public database schema and then, second, in the pg_catalog schema. Don’t worry if you don’t know what schemas are all about, I’ll explain them later. Right now you can think of schemas as if they were folders for holding tables, functions, and the like. The search_path variable works just like the PATH variable on your operating system – it’s where PostgreSQL is going to look for functionality.

The Anatomy of an Insert

I wish that I could tell you that there is something complex going on with the pagila inserts. But, unfortunately, they’re about as boring as an insert could possibly be:

INSERT INTO  -- PostgreSQL needs to listen up, we have 
             -- data coming in.
actor        -- We're adding data to the actor table,
             -- specifically into the following columns
(actor_id, first_name, last_name, last_update) 
VALUES       -- and here is a list of the values that 
             -- we are going to insert.
(1, 'PENELOPE', 'GUINESS', '2006-02-15 09:34:33');

That’s all there is to a basic insert statement – we tell PostgreSQL that we want to add a new row to the database in a specific table and then supply a list of columns and values that we would like to add. Similar code happens about several hundred more times in that file.

This is made slightly more interesting by ALTER TABLE actor DISABLE TRIGGER ALL;. This code ensures that any and all triggers on the table actor will not fire while we’re inserting our rows. Why would we want to bypass all of the triggers? Well, we don’t want any extra validation to run, nor do we want a trigger to run that will populate the primary key column of our tables with the value from a sequence – we already have those values and our insert scripts would be much more complicated if they were filled with lookup queries too.

Adding a Brand New Record

Let’s go ahead and add a new movie. Pagila is a movie rental store and we’re not going to be able to keep up with the competition without adding a new movie. Keeping with Pagila’s standard of using nothing but gibberish movie tiles, we’re going to be adding the movie “Poseidon Sundance.”

Of course, creating the movie is not as easy as simply adding the title and walking away. People need to be able to make sure that the movie is acceptable for their tastes, has their favorite actor, or maybe just find out if the film is kid friendly.

-- Here we specify the table where we're going to add the data,
--  and then list the columns where we're going to actually 
--  add the data. You don't need to list out all of the 
--  columns.
-- P.S. Lines that start with two lines like this are comments.
--  The PostgreSQL command parser will ignore them.
  'A thrilling story of a dog and a moose who have to journey across Finland to be reunited with their owners',
  1, -- language_id, this is English
  6, -- original_language_id, this is German
  3, -- rental_duration
  4.99, -- rental_rate
  114, -- length in minutes
) ;


-- Ee need to put the film into categories so people can find it.
INSERT INTO film_category
SELECT  film_id ,
        9 -- It's a  foreign film
FROM    film
/* This is also a comment. It's a multi-line comment, just like in
   C++, C#, or Java.

   We're combining an INSERT with a SELECT because we need to find out
   the film_id of the movie we just added. There are occasions where 
   we could take a different route, but we'll cover those later.

INSERT INTO film_category
SELECT  film_id ,
        8 -- films about animals are almost always family movies
FROM    film

-- Add the main actors
INSERT INTO film_actor
SELECT  71 ,
FROM    film

INSERT INTO film_actor
SELECT  123 ,
FROM    film

INSERT INTO film_actor
SELECT  97 ,
FROM    film

INSERT INTO film_actor
SELECT  49 ,
FROM    film

INSERT INTO film_actor
FROM    film

Once we’ve created our movie, let’s try to find it in the database. We can do this by writing a simple query. Don’t worry if you don’t understand everything that’s going on, everything is going to be explained in good time.

FROM    film f
        JOIN film_actor fa ON f.film_id = fa.film_id

If you’ve done everything right, this should return five rows – one for each actor in the movie.

FROM    film f
        JOIN film_category fc ON f.film_id = fc.film_id

And this should only return two rows – one for every category the movie is in.

A Flaw in the Plan

When I was originally writing this tutorial, I didn’t pay any attention to some of the views in the database that use the film table (views are magical virtual tables that we’ll talk about later). I was getting ready to test things and I discovered a big flaw in and I thought I would highlight it here.

FROM    film_list fl

This will return two rows for you. Why?

Well, it turns out whoever wrote the film_list view assumed that each movie will only belong to one category even though it can feasibly belong to as many as we want. There are a few places that we could fix this error. Before reading on see if you can think of how we could fix this.

Fixing the Flaw

There are multiple places we could fix this. The first place would be in the definition of the view. We could change the view to display all of the categories in a comma separated list, just like we’re doing for all of the actors in the film.

The next place that we can fix this is in the data itself. Apart from deleting the extra row, we can do one of two things. We can create a uniqueness requirement (called a unique constraint) on the film_category table so that only one film can be in the table at a time. This is a limiting feature of the application. My preference would be to create a unique constraint on both the film_id and category_id columns. That ensures that we will never have a duplicated row for a film and category combination, but we can still add multiple categories. Combining this with changes to the view fixes this flaw in our database design.

We’ll cover this issue later.


So, there you have it – inserts are very simple statements. We can insert entirely new data that comes from somewhere outside of the database, we can insert data that’s a mash up of other information in the database, or we can do something in between.

What’s important to take away from this is that it’s very easy to insert information into PostgreSQL. We tell PostgreSQL that we’re going to insert data into a table, list the columns that we’re going to fill with data, and then define the data that we’re going to insert.

PostgreSQL Tutorial – Referring to Other Tables

Referring to Other Tables

In the previous article you learned how to create a single table. One table does not make for a very interesting database. Since we looked at how the actor table was created, let’s take a look at the most logical follow up: the film table.

The film table

Without going into all of the details of the film table, we do know that the table looks something like this:

    film_id                 int4 NOT NULL DEFAULT nextval('film_film_id_seq'::regclass),
    title                   varchar(255) NOT NULL,
    description             text NULL,
    release_year            year NULL,
    language_id             int2 NOT NULL,
    original_language_id    int2 NULL,
    rental_duration         int2 NOT NULL DEFAULT 3,
    rental_rate             numeric(4,2) NOT NULL DEFAULT 4.99,
    length                  int2 NULL,
    replacement_cost        numeric(5,2) NOT NULL DEFAULT 19.99,
    rating                  mpaa_rating NULL DEFAULT 'G'::mpaa_rating,
    last_update             timestamp NOT NULL DEFAULT now(),
    special_features        text NULL,
    fulltext                tsvector NOT NULL,
    PRIMARY KEY(film_id)

There are a few features that will be covered later – like enumerations and full text searching – but this is a pretty standard table. The one thing that’s missing, though, is a way to reference the actors in each film. We could do this by adding a number of columns (actor1_id, actor2_id, etc.) but that would become cumbersome and is not an effective way to model data.

The alternative is to create what is commonly called a cross reference table.

Cross Reference Tables

A cross reference table is a simple way to create a cross reference between two objects that have a many-to-many relationship. This is typically done using a two column table – one column for the primary key of each table:

CREATE TABLE public.film_actor  ( 
    actor_id    int4 NOT NULL,
    film_id     int4 NOT NULL,
    last_update timestamp NOT NULL DEFAULT now(),
    PRIMARY KEY(actor_id,film_id)

Here’s the problem: as it stands, we could insert any values into this table. The whole point of the film_actor table is to map valid actors with valid films. Without any restrictions in place, we can’t make sure that any actor or film id in our table isn’t changed to an invalid value.

Enter Referential Integrity

Referential integrity is a way of saying that every value of a column in one table must exist as a value in a column in another table. In other words: every value in the referenced column or columns must exist in the referenced table. PostgreSQL makes it easy for us to enforce referential integrity through the use of foreign keys.

In order to make sure that we only have valid actors in the film_actor table, we’ll create a foreign key:

ALTER TABLE public.film_actor
ADD CONSTRAINT film_actor_actor_id_fkey
FOREIGN KEY (actor_id)
REFERENCES actor(actor_id)

That’s a very shorthand way of saying

The table "public.film_actor"
has a foreign key constraint named "film_actor_actor_id"
on the column "actor_id"
which references the column "actor_id" in the table "actor"
  so that every value of "actor_id" in "public.film_actor"
  must also be present in the "actor_id" column 
  of the table ""
if the value of "actor_id" is changed in the "" table
  we should update all values in "public.fim_actor" that are the same
and if the value of "actor_id" is deleted from the "" table
  we should prevent that delete from happening if there are rows
  in the "public.film_actor" table with the same value

We’ll create a similar foreign key to enforce referential integrity to the film table:

ALTER TABLE public.film_actor
ADD CONSTRAINT film_actor_film_id_fkey
FOREIGN KEY (film_id)
REFERENCES film(film_id)

This foreign key is almost identical to the first one. The only difference between the two are the names of the table being referenced. If you’ve taken a look at the other tables in the database, or if you’ve looked back at the previous article about creating the schema, you may have noticed something else: all of our foreign keys reference the primary key of another table.

Foreign keys must refer to a unique value, this prevents us from creating a foreign key that points to any column. However, since a primary key requires every value in that column be unique, it uniquely identifies the row, we can safely create a foreign key that points to the primary key column.

What if we want to create a foreign key that points to a different column? We can do that too. PostgreSQL has a feature called a unique constraint. This is an internal database feature that makes it possible to ensure that all values in a column, or set of columns, are unique. A unique constraint is actually implemented as a unique index in PostgreSQL, just as it is in many databases.

CREATE UNIQUE INDEX idx_unq_manager_staff_id

This index wouldn’t be a good candidate for a foreign key, but it does illustrate how you can create a unique index that you can reference with a foreign key.


Table Design Patterns: Cross-Reference Validation

PostgreSQL Tutorial – Creating the Pagila Schema

In the previous article, we talked about creating your first PostgreSQL database and I went over the syntax of createdb and psql in a bit of detail.

Next up, we’re going to look at creating the structure of the pagila database.

createdb -U jeremiah pagila
psql -U jeremiah -f pagila-schema.sql pagila 
psql -U jeremiah -f pagila-insert-data.sql pagila

Working With SQL

Even though SQL is the name of the language that we use to work with data, there are two subsets of SQL that you will hear people refer to: DML and DDL.


DDL is the common abbreviation for Data Definition Language. This is the name for the part of SQL that is used for creating and modifying database structures – tables, indexes, views, etc. The pagila-schema.sql file is chock full of DDL and is a great introduction to how to create objects in PostgreSQL.


DML stands for Data Manipulation Language. This is the other subset of SQL. This is how we read, update, insert, and delete data. We’ll be talking about DML in the next article.


Tables are the heart and soul of a relational database – they’re where data is stored. Let’s take a look at creating the actors table:

    actor_id    int4 NOT NULL DEFAULT nextval('actor_actor_id_seq'::regclass),
    first_name  varchar(45) NOT NULL,
    last_name   varchar(45) NOT NULL,
    last_update timestamp NOT NULL DEFAULT now(),
    PRIMARY KEY(actor_id)
CREATE INDEX idx_actor_last_name
CREATE TRIGGER last_updated
     EXECUTE PROCEDURE last_updated()

CREATE FUNCTION last_updated() RETURNS trigger
    AS $$
    NEW.last_update = CURRENT_TIMESTAMP;
END $$
    LANGUAGE plpgsql;

What’s going on here? Well, we’re creating a table named actor, for starters (I’ll come back to the sequence in a minute). The actor table contains four columns: actor_id, first_name, last_name, and last_update. The first_name and last_name columns are the simplest, so I’m going to start with them.

Both the first_name and last_name columns store character data – strings of text. These columns store variable length character data that can hold up to 45 characters. Both of these columns have been created as NOT NULL. This means that the columns require a value, even if that value is an empty string. Ideally you will have additional logic in place to prevent empty strings from being stored in the database as meaningful values.

The last_update column holds a date and time. The first time a row is inserted into the actor table, the last_update column will default to the current date and time. There is a trigger on this table, too, that will update last_update with the time of the update. While last_update does not contain any time zone specific information, it is possible for PostgreSQL to store time zone information by using the timestamp with time zone data type.

Finally we come to the actor_id column. This column is used as the primary key – it uniquely identifies any row in the actor table and requires the the values in the row not be null. I’ve omitted the code to create the actor_actor_id_seq sequence, but I’ll summarize. A sequence is similar an IDENTITY column in SQL Server or an AUTOINCREMENT column in MySQL, it just requires manual intervention. Whenever we want to increment the value of a sequence, we need to use the nextval() function to get the next value from the sequence. In the actor table, our primary key is a 4-byte number that will automatically be incremented by 1 every time a new row is added.

Once we have the four main columns created, we create an index idx_actor_last_name. Indexes make it possible for PostgreSQL to rapidly find data in a large table. It’s possible to create an index on more than one column. In this case, though, we’ll be creating our index on just the one column. This makes it easier to find actors by their last name.

The last thing we do is create a trigger that runs after an update. Triggers let us run pieces of code after an event (such as an insert, update, or delete) happens on a table. The last_updated trigger runs a function named last_updated before any row is updated in the database. The last_updated function modifies the NEW database row (a virtual row used in inserts and updates on row-level triggers). This new row will be inserted into the table with a new value in the last_update column.

You’re probably wondering what’s with the ::regclass and the $$ when we created the function. Be patient, everything will be explained in time. Right now it’s more important to focus on the higher level concepts and slowly move into the details as time goes on.


Creating a basic table is a simple operation, once you know what you want to store in it. You name the table and then supply a list of columns with a data type for each one. This helps PostgreSQL figure out the optimum way to store data on disk.

PostgreSQL Tutorial – Creating Your First Database

Once you’ve installed PostgreSQL (OS X Instructions) (other platforms), you’ll probably want to get started with a new database. Unfortunately PostgreSQL does not include a sample database by default.

You can download the pagila sample database from the pgFoundry (direct download link). Pagila is a simple movie rental database, sort of like what you would get if you ran a video rental store. It is also a port of a sample MySQL database with changes made to take advantage of PostgreSQL’s features – you can read about the differences in the README file. Installing pagila is easy:

createdb -U jeremiah pagila
psql -U jeremiah -f pagila-schema.sql pagila 
# There will be a great deal of text here about creating tables and other database
#  objects. You can read through the create file if you'd like, but we're going to be
#  taking a look around so you can hold off for a minute
psql -U jeremiah -f pagila-insert-data.sql pagila
# You should see a lot of output from psql again talking about inserting rows
#  and altering seqvals. Once again, we'll be looking at this data over

Before you blindly copy and paste the above and run it, let’s go over everything that we’re looking at.


createdb is a command line tool that is installed with your PostgreSQL installation. This is one way to create a new database using PostgreSQL. If you supply the -e command line option (createdb -U jeremiah -e pagila), you can see the commands that are actually being run against PostgreSQL. In this case it’s as simple as CREATE DATABASE pagila;. We need to tell PostgreSQL which user we’re using to connect because only certain users have the appropriate rights to create new databases. This isn’t something that just anyone can do, after all.


psql is a command line tool that is used to connect to your PostgreSQL databases. psql has a number of options, but I’m only going to cover the options that we’re directly using to set things up for the pagila database.


The -U option is used to specify the user name to use to connect to the database instead of the default user. Unless you’ve changed something during installation, the default user is the postgres admin account.

-f [filename]

psql can operate in two modes: an interactive mode and a file input mode. Passing the -f option to psql will use the commands in the file filename. You could use the input redirection operator (<), but using -f has the advantage of processing the file as a single batch and subsequently providing helpful line numbers in the event of an error.

The database

The last parameter we’re passing is the name of the database. If you want to connect to a database on a different host, you would use the -h or --hostname command line flag to change the database host.

createdb syntax

Some Minutiae About PostgreSQL Databases

PostgreSQL, like many other databases, has a vast set of features. Two of these features, collations and tablespaces, aren’t immediately important for understanding what we’re doing here, but it’s a good thing to know about them for when you get into more advanced scenarios.

Collations and Locales

When you create a new database you can also specify a collation for the database. Collations are used to determine how text is stored and sorted within PostgreSQL. You can set the collation when PostgreSQL is first configured, during database creation, or when connecting to the database. There are a few gotchas that aren’t important here, but you can read about them in the PostgreSQL documentation on Local Support.


You can also specify the default tablespace when you create a new database. A tablespace is the place where database objects are stored on disk. If you’re familiar with SQL Server, you will know this as a filegroup. This is a more advanced feature and can be used to achieve incredible performance benefits by spreading physical database objects across multiple physical disks. Like collation, we’re not going to worry about tablespaces for now, but it is important to know that the feature is available.

T-SQL Tuesday – “Did he just say that?” edition

As in, I didn’t participate in the most recent T-SQL Tuesday about my favorite feature in SQL 2008 R2.

Want to know my favorite 2008R2 features? PostgreSQL 9.0 and MongoDB.

PostgreSQL and MongoDB are rapidly advancing features that solve my daily problems. I’m an OLTP guy. Honestly, I don’t care about the latest reporting ding dongs and doo dads. I already know PowerShell. I manage 6 production servers and we’re unlikely to grow, so these MDM and Utility Control Points don’t make me giddy with excitement.

I solve problems using SQL.

You know what makes me happy? Support for window functions or better yet, improved support for window functions. What about exclusion constraints? Or column level triggers so I don’t have to use branching logic in my triggers? Yes, I use triggers. Or any other of these features.

What about MongoDB? I’ve just started playing with it, but it solves a lot of the problems I face at work. Not just in the day job, but in side projects as well. I’ve bitched about O/R-Ms before, but one of the biggest problems that users of O/R-Ms (developers) face is that the ideal way to model data for object-oriented programming bears no resemblance to the ideal way to store relational data. A recent article about scaling Farmville hints at this – the developers of Farmville managed scale by storing everything in a key-value store (memcached) before persisting to a relational data store later. Digg does something similar with Cassandra. It’s not like these guys are idiots, the blog posts from Digg show that they know their stuff.

MongoDB lets me solve these problems. I can control the frequency of syncs to disk (just as I can in PostgreSQL) to improve raw write performance to memory. I only have to worry about storing data the way my application expects to see the data – arrays and hashes – without worrying about building many-to-many join tables.

What about DMVs and data integrity and a write-ahead log and indexes? MongoDB has instrumentation and indexes. Yeah, you sacrifice some durability but many applications don’t need that. Hell, Amazon has even designed their systems to account for the potential of failure.

When I start my next application, I’m going to look long and hard at the platform I’m building on. There’s a good chance it’s not going to be a relational database and a really good chance it’s not going to be using SQL Server. It’s not because I have a problem with SQL Server or even RDBMSes, but because there are other choices that give me the flexibility I need to solve the problems I’m facing.

This is nothing against SQL Server 2008 R2, it’s a great step forward in the direction that SQL Server seems to be going. Sometimes I wonder if SQL Server and I are on the same road.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 531 items have been purified.