Tag tutorial

Getting Started with Hive

This is a follow up to the last two posts I made about querying HBase and Hive. The set up for this was a bit trickier than I would have liked, so I’m documenting my entire process for three reasons.

  1. To remind myself for the next time I have to do this.
  2. To help someone else get started.
  3. In the hope that someone will know a better way and help me improve this.

Installing Hadoop and Hive

This was the easiest part of the installation process. I used the Cloudera distribution of Hadoop for everything in this stack. There are a few ways to go about getting the Cloudera Distribution of Hadoop (CDH for short) on your machine. The Hadoop Quick Start is a good place to get started.

Java

I’m pretty sure we’ve covered this before, but you need to have Java 1.6 installed in order to use any part of the Hadoop ecosystem.

Windows

If you’re using Windows, I recommend using the Hadoop VMWare image that Cloudera make available. If you want to run Hadoop natively on Windows, you need to set up Cygwin and creating a UNIX-like environment on your local machine. Setting. up Cygwin can be a bit tricky the first time you do it. Luckily, you’re not alone and documentation does exist: Setting up a Single-Node Hadoop “Cluster” on Windows XP.

Linux

Linux users have it the easiest out of any platform, in my opinion. Most of the documentation assumes that you’re running some UNIX-based operating system. If you’re willing to use the CDH version of Hadoop, you can tell your package manager to talk to the Cloudera repository. Cloudera have thoughtfully provided documentation for you in the installation guide.

You issue a few commands and you’ll have a working Hadoop set up on your machine in no time.

Other *NIX

This is the route that I took with OS X. You could also do the same on Linux if you want to use the stock distribution of Hadoop. The Hadoop Quick Start guide covers this in some detail, but here’s what I did:

wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2+737.tar.gz
wget http://archive.cloudera.com/cdh/3/hive-0.5.0+32.tar.gz
wget http://archive.cloudera.com/cdh/3/pig-0.7.0+9.tar.gz

I copied these three files into my /opt folder and uncompressed them. As you can see in the screenshot, I renamed the hadoop, hive, and pig folders to use a cdh_ prefix. This makes it easier for me to identify where the installation came from, just in case I need to upgrade or raise a stink with someone.

Aliasing Hadoop for Easy Upgrades

Configuration

Nothing is ever quite as easy as it seems at first, right? Now that we have our software installed, we need to supply some configuration information. Hadoop has three modes.

  1. Standalone In standalone mode – nothing runs as a service or daemon. This requires no configuration on our part, but it does make things trickier with Hive because only one Hive database connection can exist at a time in standalone mode.
  2. Pseudo-distributed This is what we’re aiming for. In pseudo-distributed mode everything is running on one machine as a service process. What we’re actually doing (sort of) is configuring our Hadoop installation to a be a one node cluster. It’s not reliable or failure proof (or a good idea), but it does let us develop on our workstations and laptops instead of buying a lot of servers for the house.
  3. Fully distributed This is a real live cluster running on many servers. Books are being written about this.

Hadoop is, by default, set up to run in standalone mode. Having a standalone Hadoop/HDFS installation is nice, but it doesn’t do what we need in the long-term.

There are Linux packages for CDH that will set up your server to run in pseudo-distributed mode. If you used one of those, you can skip ahead. Personally, I’d stick around just so you can read up on what you’re missing and so you make me feel better (I’ll know if you skip ahead).

core-site.xml

The core-site.xml file is used for machine specific configuration details. The default configuration is set up in core-default.xml. Settings in core-default.xml are overridden by core-site.xml, so there’s no reason to go messing around in there. If you want to check out the contents of the file, it’s located in build/classes/core-default.xml.

Once you’ve opened, or created your own core-site.xml, you want to put appropriate information in there. Mine looks like this:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
    <description>The name of the default file system.  A URI whose
    scheme and authority determine the FileSystem implementation.  The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class.  The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
  </property>

  <property>
    <name>mapred.tasktracker.tasks.maximum</name>
    <value>8</value>
    <description>The maximum number of tasks that will be run simultaneously by a
    a task tracker
    </description>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
  </property>
</configuration>

I’ll summarize these three settings here. A full explanation of the core-site.xml file can be found in the Hadoop wiki: http://hadoop.apache.org/common/docs/current/cluster_setup.html.

fs.default.name

This setting is telling Hadoop to use HDFS running on the local host for storage. This could also be a local file or something in Amazon S3.

mapred.tasktracker.tasks.maximum

We want to set this to a sane value because this is the maximum number of MapReduce tasks that Hadoop will fire up. If we set this value too high, Hadoop could theoretically start up a large number of tasks and overwhelm our CPU.

dfs.replication

Non-relational databases like Hadoop get their resiliency by replicating data throughout the database cluster. Since we’re running in pseudo-distributed mode (it’s a cluster with only one node) we want to set this to 1.

hadoop-env.sh

I changed a few variables in my hadoop-env.sh file based on settings I found in the Hadoop Wiki article Running Hadoop on OS X 10.5 64-bit). You can set these environment variables anywhere in your operating system, but since they’re Hadoop-specific, it’s probably best to set them in the hadoop-env.sh file. Based on my own experience, I did not change the HADOOP_CLASSPATH variable. Here’s what the first 15 lines of my hadoop-env.sh looks like:

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home/

# Extra Java CLASSPATH elements.  Optional.
# export HADOOP_CLASSPATH="<extra_entries>:$HADOOP_CLASSPATH"

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000

I’ve also added some additional configuration to my .profile that make working with Hadoop and Hive a lot easier:

export CLASSPATH=$CLASSPATH:/Library/PostgreSQL/9.0/share/java/postgresql-9.0-801.jdbc4.jar

export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export PATH=$HADOOP_HOME/bin:$PATH
export HIVE_HOME=/opt/hive
export PATH=$HIVE_HOME/bin:$PATH
export PIG_HOME=/opt/pig
export PIGDIR=$PIG_HOME
export PATH=$PIG_HOME/bin:$PATH
export HIVE_AUX_JARS_PATH=/Library/PostgreSQL/9.0/share/java/

# CLASSPATH variables don't seem to like symbolic links on OS X
# setting up the CLASSPATH here makes it easy to change things around
# in the future as I upgrade between Hadoop/Hive/HBase/Pig versions
hadoop_dir=$HADOOP_HOME;
hadoop_version=0.20.2+737;
pig_dir=$PIG_HOME;
pig_version=0.7.0+16;

export CLASSPATH=$CLASSPATH:${hadoop_dir}/hadoop-${hadoop_version}-core.jar:${hadoop_dir}/hadoop-${hadoop_version}-tools.jar:${hadoop_dir}/hadoop-${hadoop_version}-ant.jar:${hadoop_dir}/lib/commons-logging-1.0.4.jar:${pig_dir}/pig-${pig_version}.jar:${pig_dir}/pig-${pig_version}-core.jar

One Last Step

Before we can start anything running on Hadoop we have to get our namenode ready. The namenode is part of Hadoop that manages information for the distributed filesystem. A namenode knows where all of the data is stored across our distributed filesystem. Without a running namenode, we can’t find the files we’re storing.

Before we get started, we have to format our filesystem. This process creates empty directories to be used for the namenode’s data. Datanodes aren’t involved because they can be added and removed from the cluster dynamically.

hadoop namenode -format

You’ll see some output from this command and, once it’s all said and done, you’ll have a formatted namenode that’s ready to go.

Starting Hadoop

Once you have everything set up and installed, you should be able to start Hadoop by running start-all.sh from the command prompt. You should see some messages that tell you a bunch of nodes are being started with output going to a specific directory.

Starting Hadoop on the Command Prompt

Verifying That It’s Running

You can verify that everything is configured and running hadoop dfs -ls / on the command line. You should see a basic directory listing that looks something like this image.

Verifying that Hadoop is Running

What About Hive?

Astute readers will notice that we’ve only configured Hadoop and haven’t done anything to get Hive up and running. This is, unfortunately, a separate task. Hadoop and Hive are separate tools, but they’re part of the same tool chain.

Configuring the Metastore

Hive, like Hadoop, uses XML configuration files. But, since Hive is a database, it also needs a place to store metadata about the database – it’s called the metastore. By default, Hive uses what’s called an embedded metastore. The problem with the default configuration is that it only allows for one user or process at a time to connect to a Hive database. This isn’t that good for production databases. It’s also no good if you’re like me and forget which virtual desktop you left that query window open on. We’re going to fix this.

It’s possible to store the metastore’s information in a persistent external database. Hive comes configured to use the Derby database, however any JDBC-compliant database can be used by setting a few options. This is precisely what we’re going to do. There’s a delicious irony in storing one database’s configuration settings in another database. Since I’m a huge fan of PostgreSQL, that’s what we’re going to use to get started.

Creating the Metastore User

I don’t want Hive talking to PostgreSQL as me, or any other random user account, so we’re going to set up a new user.

CREATE USER hadoop WITH PASSWORD 'hadoop';
Creating the Metastore Database

Now that we have our user, let’s create the database:

CREATE DATABASE hadoop WITH OWNER = hadoop;
Database Driver

Of course, we also need a database driver. Up in the changes to my .profile I set up a HIVE_AUX_JARS_PATH variable. This is a : delimited path where we can tell Hive to look for extra functionality to load. This folder specifically contains the PostgreSQL JDBC driver. You can download the appropriate PostgreSQL JDBC driver from the PostgreSQL JDBC driver site. Appropriate documentation is also there. It’s pretty much as simple as copying the JAR file into an appropriate directory.

hive-site.xml

Now that we’ve created our metastore user and database as well as installed a JDBC driver, the last thing to do is to make some changes in hive-site.xml. The hive-site.xml overrides the settings in hive-default.xml – just like the Hadoop configuration we did earlier. One thing to keep in mind is that you can also override Hadoop configuration in this file as well. We’re not going to, but it’s good to know that it’s possible.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>hive.root.logger</name>
        <value>INFO,console</value>
    </property>
    <property>
        <name>org.jpox.autoCreateSchema</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.metastore.local</name>
        <value>true</value>
        <description>controls whether to connect to remote metastore server or open a new metastore server in Hive Client JVM</description>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:postgresql://localhost:5432/hadoop</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>org.postgresql.Driver</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hadoop</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>hadoop</value>
    </property>
</configuration>

The javax.dbo.option.* settings are where we set up our metastore database connection. Everything else should be, largely, self-explanatory.

Creating Tables

Let’s go ahead and create a table in Hive:

CREATE TABLE leagues (
  league_id STRING,
  league_name STRING
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t';

This creates a table named leagues with two columns: league_id and league_name. The last two lines might look a bit funny to people familiar with SQL. They are specific to the Hive Query Language (HiveQL). These lines say that each row in the underlying data file will be stored as tab-delimited text.

Getting Data Into Hive

We’ve configured Hadoop and Hive and we’ve created a table. How do we get data into the table? That’s a trickier matter.

We’re going to be using sample data from Retrosheet to populate our Hive database. You should get started downloading and keep reading, it takes a bit of time.

Loading Data With Pig

Pig is a way to execute data flows written in a language called Pig Latin. In this use case, I’ve written very primitive data clean-up scripts using Pig Latin (it would have been just as easy, if not easier, to write them in Ruby or Perl).

The first section of this Pig Latin script registers the PiggyBank contributed functionality and creates convenient aliases for the functions. We use DEFINE to set up the aliases so we don’t have to type the fully qualified function name every time we want to use it. It’s a lot easier to type replace(state, 'Ohio', 'OH') than it is to type org.apache.pig.piggybank.evaluation.string.REPLACE(state, 'Ohio', 'OH').

REGISTER /opt/pig/contrib/piggybank/java/piggybank.jar;
-- Create a function alias because nobody wants to type in the full function
-- name over and over again.
DEFINE replace org.apache.pig.piggybank.evaluation.string.REPLACE();
DEFINE substring org.apache.pig.piggybank.evaluation.string.SUBSTRING();
DEFINE s_split org.apache.pig.piggybank.evaluation.string.Split();
DEFINE reverse org.apache.pig.piggybank.evaluation.string.Reverse();

-- Read the first CSV file and dump it as a tab-separated file
league_codes = LOAD 'league-codes.csv'
USING PigStorage(',')
AS (league_id:chararray, league_name);

STORE league_codes
INTO 'league-codes'
USING PigStorage('\t');

names = LOAD 'retrosheet-CurrentNames.csv'
USING PigStorage(',')
AS (franchise_id:chararray,
    historical_franchise_id:chararray,
    league_id:chararray,
    division:chararray,
    location_name:chararray,
    nickname:chararray,
    alt_nickname:chararray,
    first_game_date:chararray,
    last_game_date:chararray,
    city:chararray,
    state:chararray);

-- We don't need to explicitly specify the PigStorage mechanism.
-- Pig will use tab separated files by default.
STORE names
INTO 'team_names';

-- It's our first transformation:
-- 1) Load the data from the text file.
-- 2) Remove double-quote characters from the data.
-- 3) Write the data to the filesystem.
ballparks = LOAD 'retrosheet-ballparkcodes.txt'
USING PigStorage(',')
AS (park_id:chararray,
    park_name:chararray,
    park_nickname:chararray,
    city:chararray,
    state:chararray,
    start_date:chararray,
    end_date:chararray,
    league_id:chararray,
    notes:chararray);

formatted_parks = foreach ballparks GENERATE
  park_id,
  park_name,
  park_nickname,
  city,
  state,
  start_date,
  end_date,
  league_id,
  replace(notes, '"', '');

STORE formatted_parks
INTO 'ballparks';

personnel = LOAD 'retrosheet-umpire.coach.ids.txt'
USING PigStorage(',')
AS (last_name:chararray,
    first_name:chararray,
    person_id:chararray,
    debut_date:chararray
    );

-- the debut date is the only weirdly formatted date in here, so we'll reformat
-- it to be sortable in YYYYMMDD format
--
-- I had originally intended to do something with this, but I never
-- got around to writing it. As it stands, it's a good way to show
-- how the string splitting function works (it returns an array)
-- and how that can be combined with flatten to turn it into a tuple
split_personnel = FOREACH personnel GENERATE
  last_name,
  first_name,
  person_id,
  flatten(s_split(debut_date, '/'))
AS (
    month:chararray,
    day:chararray,
    year:chararray);

-- These subsequent x_, y_, z_ stages load our personnel
-- relations into intermediate relations while we're doing
-- work on the data. This could have been done in a single
-- step, but it's much easier to read this way.
x_personnel = FOREACH split_personnel GENERATE
  last_name,
  first_name,
  person_id,
  reverse(CONCAT('000', month)) AS month,
  reverse(CONCAT('000', day)) AS day,
  year;

y_personnel = FOREACH x_personnel GENERATE
  last_name,
  first_name,
  person_id,
  reverse(substring($3, 0, 2)) AS month,
  reverse(substring($4, 0, 2)) AS day,
  year;

formatted_personnel = FOREACH y_personnel GENERATE
  last_name,
  first_name,
  person_id,
  CONCAT(CONCAT(year, month), day) AS debut;

STORE formatted_personnel
INTO 'personnel';

You might think that each of the STORE statements would produce a file named personnel or ballparks. Unfortunately, you’d be wrong. You see, Pig turns our Pig Latin script into a MapReduce job on the fly. The final output is written into the appropriately named folder in a part file. The part files have names like part-m-00000. This tells us that it’s part of a MapReduce job, specifically the map phase, and it’s the first part. If there were multiple map operations that were performed (if we had a full cluster), there would be multiple files with incrementing part numbers.

There is a second Pig Latin program that will load the standings, I’ve left that out of the article due to the sheer length of the code. However, it’s a relatively trivial script and should be easy enough to follow. The interesting thing to note is that the Pig Latin program to load the standings makes use of variable substitution. The driver script for this whole process, load_standings.sh, passes parameters to Pig on the command line and they are interpreted at run time.

For more information on Pig and Pig Latin, you should check out the Pig project page or the Pig Latin Cookbook.

Since we’ve formatted our data, how do we load it? It’s pretty simple to do with a Hive LOAD statement:

LOAD DATA LOCAL INPATH 'personnel/part-m-00000'
OVERWRITE INTO TABLE personnel ;

LOAD DATA LOCAL INPATH 'team_names/part-m-00000'
OVERWRITE INTO TABLE teams ;

LOAD DATA LOCAL INPATH 'ballparks/part-m-00000'
OVERWRITE INTO TABLE ballparks ;

LOAD DATA LOCAL INPATH 'league-codes/part-m-00000'
OVERWRITE INTO TABLE leagues ;

Because the data for the standings is separated out by year in the Retrosheet data set, we have to drive the insertion through a shell script:

echo "  loading Box Scores..."
for d in `ls -d stats/*`
  do
    year=${d:6:4}

    echo "    Loading box scores for $year"
    for f in `ls $d/part*`
    do
      `hive -e "LOAD DATA LOCAL INPATH '$f' INTO TABLE raw_standings PARTITION (year='$year'); "`;
    done;
  done;

This loops through the output folders for the standings one at a time and loads the data into Hive. The raw_standings uses partitioning to separate out the data by year. Similar to partitioning in a relational database, this lets us store data with the same partition key in the same physical file. This improves query performance when we’re querying on a limited range of data. In this case, we’re only partitioning by year, but it’s possible to partition on multiple columns for greater granularity.

To improve long term performance further, we’ve gone ahead and created a second table, rc_standings. The rc_standings table is partitioned by year, but it’s also using a ROW FORMAT setting to create an RCFile – a columnar files. The data in this table will be stored in a column-oriented layout. This makes it possible to skip reading from columns that aren’t used in our query. With a table as wide as the rc_standings table, that’s going to be beneficial for our query performance.

In order to get the data into the RCFile table, we have to copy it using a second driver in the shell script:

for d in `ls -d stats/*`
  do
    year=${d:6:4}
    # This only overwrites the $year partition in the standings table
    # We also make use of the REGEX Column Specification to avoid pulling back the
    #  partitioning column(s) in the query.
    #  See http://wiki.apache.org/hadoop/Hive/LanguageManual/Select#REGEX_Column_Specification
    #  for more about the REGEX Column Specification.
    hive -S -e "INSERT OVERWRITE TABLE rc_standings PARTITION (year=$year) SELECT \`(year)?+.+\` FROM raw_standings WHERE year = $year;"
  done;

Loading the standings took an incredible amount of time on my test system, despite using an SSD and having 8GB of memory available. I’m not sure why, and I would love to know what I did wrong, so feel free to sound off in the comments so we can make this process faster for people to get started.

Querying Hive

Once we have the data loaded, querying becomes possible. To connect to hive, open up a command prompt and type hive. You should see blank hive prompt waiting for your commands.

The Hive Prompt

Let’s take a look at the contents of a single table. Type select * from leagues; and hit enter. You should see something like the following:

A Simple Query

Let’s try something trickier and use a JOIN.

select l.league_name, t.franchise_id, t.first_game_date, t.last_game_date, t.nickname
from leagues l
join teams t on l.league_id = t.league_id;

Query Results - With a Join

So far, so good. These have both been very small queries. What if we want to look at some aggregated data. What then? Turns out that it’s just as easy to do as any other query.

select l.league_name, s.year, avg(s.home_score)
from rc_standings s
join leagues l on s.home_league_id = l.league_id
group by l.league_name, s.year;

Now We're Using GROUP BY

Once again, Hive is responding exactly as we would expect. That’s because HiveQL was designed to be as close to SQL as possible. The Hive engine is transforming our HiveQL into MapReduce jobs in the back end. Depending on the type of queries that we’re running, Our queries may execute on as few as one node in the cluster or as many as all nodes in the cluster.

While HiveSQL looks like SQL, some additions have been made that make sense for Hive – SORT BY guarantees that the order of results within a single reducer while ORDER BY guarantees the order of results in the total output. There is also a DISTRIBUTE BY command determines how our data will be distributed to the reducers that are handling the query.

Resources

The pig latin, shell script, and HiveSQL files used in this script can be found here, on Amazon S3.

The source data can be found here. Please note that you’ll need 7-zip to get to the source data. Which, by the way, was obtained free of charge and is copyrighted by Retrosheet. For more information, please contact www.retrosheet.org.

Cassandra Tutorial – Installing Cassandra

I was feeling a little bit crazy the other day and I installed Cassandra on my home computer. Twice. Just because. It wasn’t really a “just because” moment. I installed Cassandra twice so I could get the hang of it in case anyone asked me how they could go ahead and get started. The process was relatively painless, but there are some prerequisites that you need to have installed before you can get started. I thought that it would be good to create this short Cassandra installation tutorial to help you get started.

This video walks through everything you need to do get Cassandra installed and running on your computer. Apart from installing the OS. If you don’t have an OS, I don’t know how you’re reading this, probably witchcraft.

[media id=3 width=800 height=600]

PostgreSQL Tutorial – Inserting Data

In the last two articles, we talked about creating your first PostgreSQL database, the syntax of createdb and psql, and creating tables in the database. Next up, we’re going to look at filling those tables with data.

createdb -U jeremiah pagila
psql -U jeremiah -f pagila-schema.sql pagila
psql -U jeremiah -f pagila-insert-data.sql pagila

Loading Data From A File

We’re going to be looking at the pagila-insert-data.sql file. There is a separate file, pagila-data.sql, but we will not be using it. Both files accomplish the same purpose, but pagila-insert-data.sql uses standard SQL syntax whereas pagila-data.sql uses the COPY command do the bulk of the work. Since our goal is to familiarize ourselves with basic SQL, we’re going to use pagila-insert-data.sql.

Preparation is Key

The first thing you will notice is that there are close to 100 lines of options and comments (comments are start with -- and end at the end of the current line) before we get to the actual meat of the file. This is because we need to set up common settings before we start working. Let’s take a look at these options line by line.

SET client_encoding = 'UTF8';

Strings will be treated as Unicode strings. This makes sure that characters from non-Latin alphabets can easily be inserted into the database.

SET standard_conforming_strings = off;

The standard_conforming_strings setting ensures some backwards compatibility with PostgreSQL behavior – we can use ‘' as an escape character in strings. When this is set to off, PostgreSQL will use the ‘' as an escape character. The goal is to, some day, default this setting to on.

SET check_function_bodies = false;

We are, in effect, recovering from a database dump. By setting check_function_bodies to false we tell PostgreSQL that we want to put off a lot of checks until later because things might not be completely set up at the moment and we’ll accept the risks of doing this, thanks. I would never recommend using this setting in a production environment unless you’re recovering from a data dump. If that’s the case, knock yourself out.

SET client_min_messages = warning;

This controls the amount of messages that we are going to receive from PostgreSQL. Leaving this alone is a good thing but it’s important to know that you can mute all messages from the server if you so desire.

SET escape_string_warning = off;

The escape_string_warning setting actually goes along with the standard_conforming_strings setting and has to do with how PostgreSQL will behave when it encounters a ‘' character rampaging about in our code.

SET search_path = public, pg_catalog;

There’s a history in the PostgreSQL world of omitting the schema qualification from table names in queries. So, what we’re doing here, is telling PostgreSQL that we would like to look for table first in the public database schema and then, second, in the pg_catalog schema. Don’t worry if you don’t know what schemas are all about, I’ll explain them later. Right now you can think of schemas as if they were folders for holding tables, functions, and the like. The search_path variable works just like the PATH variable on your operating system – it’s where PostgreSQL is going to look for functionality.

The Anatomy of an Insert

I wish that I could tell you that there is something complex going on with the pagila inserts. But, unfortunately, they’re about as boring as an insert could possibly be:

INSERT INTO  -- PostgreSQL needs to listen up, we have
             -- data coming in.
actor        -- We're adding data to the actor table,
             -- specifically into the following columns
(actor_id, first_name, last_name, last_update)
VALUES       -- and here is a list of the values that
             -- we are going to insert.
(1, 'PENELOPE', 'GUINESS', '2006-02-15 09:34:33');

That’s all there is to a basic insert statement – we tell PostgreSQL that we want to add a new row to the database in a specific table and then supply a list of columns and values that we would like to add. Similar code happens about several hundred more times in that file.

This is made slightly more interesting by ALTER TABLE actor DISABLE TRIGGER ALL;. This code ensures that any and all triggers on the table actor will not fire while we’re inserting our rows. Why would we want to bypass all of the triggers? Well, we don’t want any extra validation to run, nor do we want a trigger to run that will populate the primary key column of our tables with the value from a sequence – we already have those values and our insert scripts would be much more complicated if they were filled with lookup queries too.

Adding a Brand New Record

Let’s go ahead and add a new movie. Pagila is a movie rental store and we’re not going to be able to keep up with the competition without adding a new movie. Keeping with Pagila’s standard of using nothing but gibberish movie tiles, we’re going to be adding the movie “Poseidon Sundance.”

Of course, creating the movie is not as easy as simply adding the title and walking away. People need to be able to make sure that the movie is acceptable for their tastes, has their favorite actor, or maybe just find out if the film is kid friendly.

INSERT INTO film
-- Here we specify the table where we're going to add the data,
--  and then list the columns where we're going to actually
--  add the data. You don't need to list out all of the
--  columns.
-- P.S. Lines that start with two lines like this are comments.
--  The PostgreSQL command parser will ignore them.
(
  title,
  description,
  release_year,
  language_id,
  original_language_id,
  rental_duration,
  rental_rate,
  length,
  replacement_cost,
  rating
)
VALUES
(
  'POSEIDON SUNDANCE',
  'A thrilling story of a dog and a moose who have to journey across Finland to be reunited with their owners',
  '2006',
  1, -- language_id, this is English
  6, -- original_language_id, this is German
  3, -- rental_duration
  4.99, -- rental_rate
  114, -- length in minutes
  24.99,
  'G'::mpaa_rating
) ;

SELECT * FROM film WHERE title = 'POSEIDON SUNDANCE' ;

-- Ee need to put the film into categories so people can find it.
INSERT INTO film_category
(
  film_id,
  category_id
)
SELECT  film_id ,
        9 -- It's a  foreign film
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;
/* This is also a comment. It's a multi-line comment, just like in
   C++, C#, or Java.

   We're combining an INSERT with a SELECT because we need to find out
   the film_id of the movie we just added. There are occasions where
   we could take a different route, but we'll cover those later.
 */

INSERT INTO film_category
(
  film_id,
  category_id
)
SELECT  film_id ,
        8 -- films about animals are almost always family movies
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

-- Add the main actors
INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  71 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  123 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  97 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  49 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  3 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

Once we’ve created our movie, let’s try to find it in the database. We can do this by writing a simple query. Don’t worry if you don’t understand everything that’s going on, everything is going to be explained in good time.

SELECT  *
FROM    film f
        JOIN film_actor fa ON f.film_id = fa.film_id
WHERE   f.title = 'POSEIDON SUNDANCE' ;

If you’ve done everything right, this should return five rows – one for each actor in the movie.

SELECT  *
FROM    film f
        JOIN film_category fc ON f.film_id = fc.film_id
WHERE   f.title = 'POSEIDON SUNDANCE' ;

And this should only return two rows – one for every category the movie is in.

A Flaw in the Plan

When I was originally writing this tutorial, I didn’t pay any attention to some of the views in the database that use the film table (views are magical virtual tables that we’ll talk about later). I was getting ready to test things and I discovered a big flaw in and I thought I would highlight it here.

SELECT  *
FROM    film_list fl
WHERE   fl.title = 'POSEIDON SUNDANCE' ;

This will return two rows for you. Why?

Well, it turns out whoever wrote the film_list view assumed that each movie will only belong to one category even though it can feasibly belong to as many as we want. There are a few places that we could fix this error. Before reading on see if you can think of how we could fix this.

Fixing the Flaw

There are multiple places we could fix this. The first place would be in the definition of the view. We could change the view to display all of the categories in a comma separated list, just like we’re doing for all of the actors in the film.

The next place that we can fix this is in the data itself. Apart from deleting the extra row, we can do one of two things. We can create a uniqueness requirement (called a unique constraint) on the film_category table so that only one film can be in the table at a time. This is a limiting feature of the application. My preference would be to create a unique constraint on both the film_id and category_id columns. That ensures that we will never have a duplicated row for a film and category combination, but we can still add multiple categories. Combining this with changes to the view fixes this flaw in our database design.

We’ll cover this issue later.

Summary

So, there you have it – inserts are very simple statements. We can insert entirely new data that comes from somewhere outside of the database, we can insert data that’s a mash up of other information in the database, or we can do something in between.

What’s important to take away from this is that it’s very easy to insert information into PostgreSQL. We tell PostgreSQL that we’re going to insert data into a table, list the columns that we’re going to fill with data, and then define the data that we’re going to insert.

PostgreSQL Tutorial – Referring to Other Tables

Referring to Other Tables

In the previous article you learned how to create a single table. One table does not make for a very interesting database. Since we looked at how the actor table was created, let’s take a look at the most logical follow up: the film table.

The film table

Without going into all of the details of the film table, we do know that the table looks something like this:

CREATE TABLE public.film  (
    film_id                 int4 NOT NULL DEFAULT nextval('film_film_id_seq'::regclass),
    title                   varchar(255) NOT NULL,
    description             text NULL,
    release_year            year NULL,
    language_id             int2 NOT NULL,
    original_language_id    int2 NULL,
    rental_duration         int2 NOT NULL DEFAULT 3,
    rental_rate             numeric(4,2) NOT NULL DEFAULT 4.99,
    length                  int2 NULL,
    replacement_cost        numeric(5,2) NOT NULL DEFAULT 19.99,
    rating                  mpaa_rating NULL DEFAULT 'G'::mpaa_rating,
    last_update             timestamp NOT NULL DEFAULT now(),
    special_features        text NULL,
    fulltext                tsvector NOT NULL,
    PRIMARY KEY(film_id)
)

There are a few features that will be covered later – like enumerations and full text searching – but this is a pretty standard table. The one thing that’s missing, though, is a way to reference the actors in each film. We could do this by adding a number of columns (actor1_id, actor2_id, etc.) but that would become cumbersome and is not an effective way to model data.

The alternative is to create what is commonly called a cross reference table.

Cross Reference Tables

A cross reference table is a simple way to create a cross reference between two objects that have a many-to-many relationship. This is typically done using a two column table – one column for the primary key of each table:

CREATE TABLE public.film_actor  (
    actor_id    int4 NOT NULL,
    film_id     int4 NOT NULL,
    last_update timestamp NOT NULL DEFAULT now(),
    PRIMARY KEY(actor_id,film_id)
)

Here’s the problem: as it stands, we could insert any values into this table. The whole point of the film_actor table is to map valid actors with valid films. Without any restrictions in place, we can’t make sure that any actor or film id in our table isn’t changed to an invalid value.

Enter Referential Integrity

Referential integrity is a way of saying that every value of a column in one table must exist as a value in a column in another table. In other words: every value in the referenced column or columns must exist in the referenced table. PostgreSQL makes it easy for us to enforce referential integrity through the use of foreign keys.

In order to make sure that we only have valid actors in the film_actor table, we’ll create a foreign key:

ALTER TABLE public.film_actor
ADD CONSTRAINT film_actor_actor_id_fkey
FOREIGN KEY (actor_id)
REFERENCES actor(actor_id)
ON UPDATE CASCADE
ON DELETE RESTRICT

That’s a very shorthand way of saying

The table "public.film_actor"
has a foreign key constraint named "film_actor_actor_id"
on the column "actor_id"
which references the column "actor_id" in the table "actor"
  so that every value of "actor_id" in "public.film_actor"
  must also be present in the "actor_id" column
  of the table "public.actor"
if the value of "actor_id" is changed in the "public.actor" table
  we should update all values in "public.fim_actor" that are the same
and if the value of "actor_id" is deleted from the "public.actor" table
  we should prevent that delete from happening if there are rows
  in the "public.film_actor" table with the same value

We’ll create a similar foreign key to enforce referential integrity to the film table:

ALTER TABLE public.film_actor
ADD CONSTRAINT film_actor_film_id_fkey
FOREIGN KEY (film_id)
REFERENCES film(film_id)
ON UPDATE CASCADE
ON DELETE RESTRICT

This foreign key is almost identical to the first one. The only difference between the two are the names of the table being referenced. If you’ve taken a look at the other tables in the database, or if you’ve looked back at the previous article about creating the schema, you may have noticed something else: all of our foreign keys reference the primary key of another table.

Foreign keys must refer to a unique value, this prevents us from creating a foreign key that points to any column. However, since a primary key requires every value in that column be unique, it uniquely identifies the row, we can safely create a foreign key that points to the primary key column.

What if we want to create a foreign key that points to a different column? We can do that too. PostgreSQL has a feature called a unique constraint. This is an internal database feature that makes it possible to ensure that all values in a column, or set of columns, are unique. A unique constraint is actually implemented as a unique index in PostgreSQL, just as it is in many databases.

CREATE UNIQUE INDEX idx_unq_manager_staff_id
ON public.store(manager_staff_id)

This index wouldn’t be a good candidate for a foreign key, but it does illustrate how you can create a unique index that you can reference with a foreign key.

References

Table Design Patterns: Cross-Reference Validation

MongoDB – Basic Querying

I put together a little video tutorial showing you how to accomplish some basic querying with MongoDB.

[media id=2 width=558 height=410]

Download the sample code. A QuickTime/H.264/iPhone formatted video is also available – download it now.

PostgreSQL Tutorial – Creating the Pagila Schema

In the previous article, we talked about creating your first PostgreSQL database and I went over the syntax of createdb and psql in a bit of detail.

Next up, we’re going to look at creating the structure of the pagila database.

createdb -U jeremiah pagila
psql -U jeremiah -f pagila-schema.sql pagila
psql -U jeremiah -f pagila-insert-data.sql pagila

Working With SQL

Even though SQL is the name of the language that we use to work with data, there are two subsets of SQL that you will hear people refer to: DML and DDL.

DDL

DDL is the common abbreviation for Data Definition Language. This is the name for the part of SQL that is used for creating and modifying database structures – tables, indexes, views, etc. The pagila-schema.sql file is chock full of DDL and is a great introduction to how to create objects in PostgreSQL.

DML

DML stands for Data Manipulation Language. This is the other subset of SQL. This is how we read, update, insert, and delete data. We’ll be talking about DML in the next article.

Tables

Tables are the heart and soul of a relational database – they’re where data is stored. Let’s take a look at creating the actors table:

CREATE TABLE public.actor  (
    actor_id    int4 NOT NULL DEFAULT nextval('actor_actor_id_seq'::regclass),
    first_name  varchar(45) NOT NULL,
    last_name   varchar(45) NOT NULL,
    last_update timestamp NOT NULL DEFAULT now(),
    PRIMARY KEY(actor_id)
)
GO
CREATE INDEX idx_actor_last_name
    ON public.actor(last_name)
GO
CREATE TRIGGER last_updated
     BEFORE UPDATE ON actor FOR EACH ROW
     EXECUTE PROCEDURE last_updated()
GO

CREATE FUNCTION last_updated() RETURNS trigger
    AS $$
BEGIN
    NEW.last_update = CURRENT_TIMESTAMP;
    RETURN NEW;
END $$
    LANGUAGE plpgsql;

What’s going on here? Well, we’re creating a table named actor, for starters (I’ll come back to the sequence in a minute). The actor table contains four columns: actor_id, first_name, last_name, and last_update. The first_name and last_name columns are the simplest, so I’m going to start with them.

Both the first_name and last_name columns store character data – strings of text. These columns store variable length character data that can hold up to 45 characters. Both of these columns have been created as NOT NULL. This means that the columns require a value, even if that value is an empty string. Ideally you will have additional logic in place to prevent empty strings from being stored in the database as meaningful values.

The last_update column holds a date and time. The first time a row is inserted into the actor table, the last_update column will default to the current date and time. There is a trigger on this table, too, that will update last_update with the time of the update. While last_update does not contain any time zone specific information, it is possible for PostgreSQL to store time zone information by using the timestamp with time zone data type.

Finally we come to the actor_id column. This column is used as the primary key – it uniquely identifies any row in the actor table and requires the the values in the row not be null. I’ve omitted the code to create the actor_actor_id_seq sequence, but I’ll summarize. A sequence is similar an IDENTITY column in SQL Server or an AUTOINCREMENT column in MySQL, it just requires manual intervention. Whenever we want to increment the value of a sequence, we need to use the nextval() function to get the next value from the sequence. In the actor table, our primary key is a 4-byte number that will automatically be incremented by 1 every time a new row is added.

Once we have the four main columns created, we create an index idx_actor_last_name. Indexes make it possible for PostgreSQL to rapidly find data in a large table. It’s possible to create an index on more than one column. In this case, though, we’ll be creating our index on just the one column. This makes it easier to find actors by their last name.

The last thing we do is create a trigger that runs after an update. Triggers let us run pieces of code after an event (such as an insert, update, or delete) happens on a table. The last_updated trigger runs a function named last_updated before any row is updated in the database. The last_updated function modifies the NEW database row (a virtual row used in inserts and updates on row-level triggers). This new row will be inserted into the table with a new value in the last_update column.

You’re probably wondering what’s with the ::regclass and the $$ when we created the function. Be patient, everything will be explained in time. Right now it’s more important to focus on the higher level concepts and slowly move into the details as time goes on.

Summary

Creating a basic table is a simple operation, once you know what you want to store in it. You name the table and then supply a list of columns with a data type for each one. This helps PostgreSQL figure out the optimum way to store data on disk.

PostgreSQL Tutorial – Creating Your First Database

Once you’ve installed PostgreSQL (OS X Instructions) (other platforms), you’ll probably want to get started with a new database. Unfortunately PostgreSQL does not include a sample database by default.

You can download the pagila sample database from the pgFoundry (direct download link). Pagila is a simple movie rental database, sort of like what you would get if you ran a video rental store. It is also a port of a sample MySQL database with changes made to take advantage of PostgreSQL’s features – you can read about the differences in the README file. Installing pagila is easy:

createdb -U jeremiah pagila
psql -U jeremiah -f pagila-schema.sql pagila
# There will be a great deal of text here about creating tables and other database
#  objects. You can read through the create file if you'd like, but we're going to be
#  taking a look around so you can hold off for a minute
psql -U jeremiah -f pagila-insert-data.sql pagila
# You should see a lot of output from psql again talking about inserting rows
#  and altering seqvals. Once again, we'll be looking at this data over

Before you blindly copy and paste the above and run it, let’s go over everything that we’re looking at.

createdb

createdb is a command line tool that is installed with your PostgreSQL installation. This is one way to create a new database using PostgreSQL. If you supply the -e command line option (createdb -U jeremiah -e pagila), you can see the commands that are actually being run against PostgreSQL. In this case it’s as simple as CREATE DATABASE pagila;. We need to tell PostgreSQL which user we’re using to connect because only certain users have the appropriate rights to create new databases. This isn’t something that just anyone can do, after all.

psql

psql is a command line tool that is used to connect to your PostgreSQL databases. psql has a number of options, but I’m only going to cover the options that we’re directly using to set things up for the pagila database.

-U

The -U option is used to specify the user name to use to connect to the database instead of the default user. Unless you’ve changed something during installation, the default user is the postgres admin account.

-f [filename]

psql can operate in two modes: an interactive mode and a file input mode. Passing the -f option to psql will use the commands in the file filename. You could use the input redirection operator (<), but using -f has the advantage of processing the file as a single batch and subsequently providing helpful line numbers in the event of an error.

The database

The last parameter we’re passing is the name of the database. If you want to connect to a database on a different host, you would use the -h or --hostname command line flag to change the database host.

createdb syntax
CREATE DATABASE
psql

Some Minutiae About PostgreSQL Databases

PostgreSQL, like many other databases, has a vast set of features. Two of these features, collations and tablespaces, aren’t immediately important for understanding what we’re doing here, but it’s a good thing to know about them for when you get into more advanced scenarios.

Collations and Locales

When you create a new database you can also specify a collation for the database. Collations are used to determine how text is stored and sorted within PostgreSQL. You can set the collation when PostgreSQL is first configured, during database creation, or when connecting to the database. There are a few gotchas that aren’t important here, but you can read about them in the PostgreSQL documentation on Local Support.

Tablespaces

You can also specify the default tablespace when you create a new database. A tablespace is the place where database objects are stored on disk. If you’re familiar with SQL Server, you will know this as a filegroup. This is a more advanced feature and can be used to achieve incredible performance benefits by spreading physical database objects across multiple physical disks. Like collation, we’re not going to worry about tablespaces for now, but it is important to know that the feature is available.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.