Category Database

Three Things to Watch with NoSQL

Last week I painted a rosy picture of NoSQL databases. Before you deploy any kind of NoSQL database, you need to be aware of the potential pitfalls of NoSQL databases.

1) Architecture

NoSQL databases have different architectural concerns than traditional RDMBSes.

If you’re wokring with a BigTable or Dynamo database, your network backbone will need to be able to handle the increased traffic from replication and materializing MapReduce data. Dynamo databases are very chatty – reads and writes are only guaranteed when a specific number of servers respond to a query.

MapReduce queries run across many nodes in a cluster. One stage processing a MapReduce query is temporarily materializing intermediate results from each node. The intermediate results may then be shipped to another server for additional processing, aggregation, or just be combined into the final result set.

2) Querying Strategy

RDBMSes are able to perform a mix of batch and ad hoc querying. They produce reasonably fast results for doing online aggregation and analysis of data. It’s just as easy to write a query that returns a single row from a relational database as it is to write a query that aggregates data from several thousand rows.

Not all NoSQL databases are created equally. Hadoop and Cassandra are built to perform large scale analytical queries. Their implementation of MapReduce is designed to process batches of analytical results. Frequently these results are going to be pre-aggregated rather than delivered in real time. Riak and MongoDB, on the other hand, are designed to handle interactive querying.

3) Data Modeling

Data modeling is a tricky field of study in the world of relational databases. A huge number of books have been written on the subject and they deal with both the theoretical and practical aspects of modeling data. In short, while data modeling is tricky it is also a well understood field of study.

Data modeling in the non-relational world is an entirely new field. Many problems that have been solved in relational data modeling are new and require a different way of thinking. BigTable databases, while their structure may initially appear familiar, require different modeling techniques to get the most out of the data store. MongoDB’s document oriented approach makes a great deal of sense to developers, but it still poses a number of data modeling questions about how to link related objects and how deeply nested object graphs should be in the database. Key/Value stores, like Riak, have very little structure and give the application developer a huge amount of freedom to model data.

Summation & Further Reading

NoSQL Databases hold a lot of promise to help businesses rapidly respond to data growth. Whenever you’re planning to make a major infrastructure decision, you need to weigh the pros and cons before determining which solution is right for you. While NoSQL databases provide answers to a lot of questions, they also pose a different set of questions that require a different set of skills to answer.

NoSQL Glossary
MongoDB – Sharding with the Fishes
Hadoop for Batch Processing (from the Yahoo! Developer Network)
Schema Design in Riak
Schema Design for Riak (presentation slides)

Five Reasons to Use NoSQL

You’re not using a NoSQL database right now? I’m not all that surprised. Many IT shops are still evaluating moving from SQL Server 2000 to SQL Server 2005, much less these non-relational databases. A lot of people don’t even know what they would use a NoSQL database for. Does it replace the RDBMS? Work alongside it? Do something else?

1) Analytics

One reason to consider adding a NoSQL database to your corporate infrastructure is that many NoSQL databases are well suited to performing analytical queries. Developers can use the same querying languages to perform analytical queries that they’re using to perform atomic queries. Typically this will be some variation of a MapReduce query, but it’s also possible to query data using Pig or Hive. Don’t worry too much about these weird language terms, MapReduce is a fancy way of saying “SELECT and then GROUP BY” and doing it in a way that is entirely confusing to people who are used to SQL.

Many NoSQL systems boast phenomenal write performance. When you combine high write performance with batch processing it is easy to pre-aggregate data, summarize results, and still guarantee ad hoc query performance.

2) Scale

NoSQL databases are designed to scale; it’s one of the primary reasons that people choose a NoSQL database. Typically, with a relational database like SQL Server or Oracle, you scale by purchasing larger and faster servers and storage or by employing specialists to provide additional tuning. Unlike relational databases, NoSQL databases are designed to easily scale out as they grow. Data is partitioned and balanced across multiple nodes in a cluster, and aggregate queries are distributed by default. Scaling is as easy as racking a new server and executing a few commands to add the new server to the cluster (yeah, it really is that easy). Data will start flowing and you’ll back in business in no time.

3) Redundancy

In addition to rapid scaleability, NoSQL databases are also designed with redundancy in mind. These databases were designed and built at massive scales where the rarest hardware problems go from being freak events to eventualities. Hardware will fail. Rather than treat hardware failure as an exceptional event, NoSQL databases are designed to handle it. While hardware failure is still a serious concern, this concern is addressed at the architectural level of the database, rather than requiring developers, DBAs, and operations staff to build their own redundant solutions. Cassandra uses a number of heuristics to determine the likelihood of node failure. Riak takes a different approach and can survive network partitioning (when one or more nodes in a cluster become isolated) and repair itself.

4) Flexibility

What’s the use of a database if it’s not flexible? While the data modeling issues are completely different in NoSQL, there is a large amount of flexibility in how data is stored for performance.

Databases modeled like Bigtable and Cassandra provide flexibility around how data is stored on disk. It’s possible to create derived column families. In plain English: you can design your database to duplicate frequently accessed data for rapid query response. This is, of course, based on the assumption that writes and storage space are cheap.

Databases based on the Bigtable model also have another benefit – outside of key structure it’s possible to store a variety of disparate data in the same table. Structure is largely irrelevant. Relational databases have adopted features to solve similar problems (such as sparse columns in SQL Server), but they carry overhead. Storing wildly different columns in multiple rows of the same column family is so cheap as to be invisible in a NoSQL database.

Lastly, key-value stores provide an incredible level of flexibility. Data is arbitrarily stored as a value. Key-value databases make it possible to store images, word documents, strings, integers, and serialized objects within the same database. This requires more responsibility and creative thinking on the part of application developers and architects but it also lets the people designing the system build custom a completely custom solution that fills their needs.

5) Rapid Development

Let’s face facts: everyone wants their application to be faster, have more features, and they want it yesterday. NoSQL databases make it easy to change how data is stored or change the queries you’re running. Massive changes to data can be accomplished with simple refactoring and batch processing rather than complex migration scripts and outages and it’s even easier to take nodes in a cluster offline for changes and add them back into a cluster as the new master server – replication features will take care of syncing up data and propagating the new data design out to the other servers in a cluster.

Moving Away From Relational Storage – Introduction

Don’t. Fooled you, didn’t I?

If you’re already using a relational database, keep using it. If it’s scaling just fine with your hardware and workload, keep using it. If you aren’t running into any complexity problems, keep using it.

There’s no reason to change the way you’re storing your data just because you read an article about how BrandNewStartup.com was able to increase uptime and throughput eleventy-four percent by utilizing a new key-value storage solution. That’s like re-writing your core product every time you read an article about how Ruby has excellent metaprogramming features or how Python’s use of significant whitespace can lead to more readable code. These are reasons to choose a language for a new product, but they are not reasons to change the language of an existing project.

If you’re not supposed to move away from relational storage, what is this all about?

When Should I Switch?

When you add new features to an application, or when you rewrite a feature, take a look at what you need. When we needed to make use of large query result caching in a web application, we started using Velocity, which later became AppFabric. There’s some overhead serializing and deserializing objects into the storage mechanism, but could you imagine trying to dump that data to the database? The write overhead would be tremendous! We looked around at products that would work well with our existing application infrastructure, ASP.NET and SQL Server, and chose something that would play well in that garden, Windows Server AppFabric.

Before you start new you should ask yourself questions about what you’re trying to build. Question your assumptions about how the new feature or product will work. Question your assumptions about the existing infrastructure. Make sure that you aren’t shoehorning existing technology into a solution because you are familiar with it. Just because something feels familiar and safe doesn’t mean that it’s the best solution for a problem.

What Questions Should I Ask?

This is the trickiest part. It’s the part that I’ve struggled with and gotten wrong on more than one occasion. It’s okay to screw up, it’s how we learn.

Here are the questions I’ve started asking myself when I start a new project/feature that needs additional data storage:

  • Why are we storing this data?
  • How much data will I collect in a week? A month? A year?
  • How long will this data need to live?
  • How will this data be used?
  • How structured does this data need to be?
  • How available does this data need to be?

Why Are We Storing This Data?

You need to understand how you’ll be using the data before you can figure out how you want to store it. Will you be doing ad hoc reporting? Will the data be aggregated and consumed by other applications? How often will I write this data? How often will I read my data?

Different types of databases have different use case profiles. The way that you’re using the data will make a big difference for how you’re going to store data. You don’t want to store session state in a relational database – you’ll spend a lot of time writing transient data to disk. Likewise, you don’t want to store financial transactions in an in-memory cache.

How Much Data Will I Collect?

The volume of data that you’re working with will influence the way that you’re storing the data. Terabytes, and even petabytes, of data require different storage techniques and management strategies in a relational database, why would it be different anywhere else? Handling huge quantities of data often requires splitting the load across multiple servers or purchasing a SAN. Either way, you’ll want to consider long term budget and how the availability of those budget dollars might change over time. Just as important as the long term capacity of your data is the speed you’re collecting data. The faster you need to collect data, the more you need to look at how you’re storing that data.

Every database engine employs different strategies to maximize I/O throughput. The problem here is that they all use different strategies based on their use case.

How Long Will This Data Need to Live?

The lifespan of data is incredibly important. Short term data need never touch disk – it can live in memory. If this is something like session state data, it’s possible to use a 100% in memory storage solution (like memcached or AppFabric or even riak’s riakkvcache_backend) to solve this problem. Disks are slow, memory is fast.

If the data needs to live longer than a few seconds, then it’s time to consider how long it will really need to stick around. You need to look at different forms of persistence and how the strengths and weaknesses of those systems play into your long term choices. Some next-generation data stores will store data in memory and persist to disk in the background. This speeds up the ability to write data but it does bring up some data life concerns: what happens if the power fails? Other data stores use a write-ahead log, like relational databases, to make sure that the data is safe.

How Will This Data Be Used?

The way that end users are going to use our data is important for our decision making. Some data stores (such as CouchDB) do not allow ad hoc querying. Others (graph databases) make it possible to easily navigate deeply nested and complex data structures.

Relational databases are phenomenal general purpose data stores. They make it possible to store data in a variety of formats but there may be a variety of complications as a result of the general nature of SQL and the relational model. Massive volumes of data stored for statistical analysis have different storage and indexing requirements than data that needs to be instantly available from a variety of locations for atomic reads and writes.

How Structured Does This Data Need To Be?

Data structure can be important for a variety of reasons. Hierarchical data stores make it possible to traverse deeply nested category trees – think of how species are classified. Likewise, graph databases make it possible to navigate through complex relationships, much like the relationships that can be found in strongly object-oriented designs.

Considering how the structure of the data can make it very easy to make a decision on the data store that you’ll be using. If a lot of flexibility is required, it may even be possible that a Big Table derived database, such as Cassandra or HBase might meet your storage needs.

How Available Does This Data Need To Be?

One of the benefits of a relational database is that once a transaction commits, the data is immediately available to everyone querying that database (ignoring things like replication and log shipping). When you start working with distributed data stores, the data may not be immediately available. You need to ask yourself about how available the data needs to be. In some data stores, the delay may be only a few milliseconds. In others, it may be longer. Network latency and hardware utilization play a large role in the latency of data replication.

The other side of the availability question is “How fault tolerant does this data need to be?” With a relational database, unless you’ve implemented a clustering solution, the data is dependent on a single server functioning appropriately. If that single server goes down, business could stop until a replacement is brought back online. Some of the distributed data stores are massively fault tolerant – Riak is designed to be tolerant of hardware failure. If nodes fail, data will be replicated to new nodes, or even to new clusters, automatically. Replacement servers can be brought online and they will begin receiving new data immediately.

The immediately availability of data after a write as well as the long term (fault tolerance) availability of data is an important aspect of choosing a data storage solution.

What’s Next?

Ask yourself these questions the next time that you’re implementing new functionality or an entirely new application. The answers to your questions might very well surprised you.

In the next few weeks, I’ll be putting more thoughts together and showing examples of realistic scenarios of moving away from a relational database and utilizing a NoSQL solution.

Further Reading

NoSQL Summer
NoSQL Heroku and You
What is Big Data?
Big Data Is Less About Size, And More About Freedom
NoSQL: The Dawn of Polyglot Persistence

Finding SQL Agent Jobs Owned By The Wrong Person

Ever have someone leave the company only to find out that they own critical database processes… because that process failed? It hasn’t happened to me, yet, but a recent Active Directory outage this weekend got me thinking about it.

PostgreSQL Tutorial – Inserting Data

In the last two articles, we talked about creating your first PostgreSQL database, the syntax of createdb and psql, and creating tables in the database. Next up, we’re going to look at filling those tables with data.

createdb -U jeremiah pagila
psql -U jeremiah -f pagila-schema.sql pagila
psql -U jeremiah -f pagila-insert-data.sql pagila

Loading Data From A File

We’re going to be looking at the pagila-insert-data.sql file. There is a separate file, pagila-data.sql, but we will not be using it. Both files accomplish the same purpose, but pagila-insert-data.sql uses standard SQL syntax whereas pagila-data.sql uses the COPY command do the bulk of the work. Since our goal is to familiarize ourselves with basic SQL, we’re going to use pagila-insert-data.sql.

Preparation is Key

The first thing you will notice is that there are close to 100 lines of options and comments (comments are start with -- and end at the end of the current line) before we get to the actual meat of the file. This is because we need to set up common settings before we start working. Let’s take a look at these options line by line.

SET client_encoding = 'UTF8';

Strings will be treated as Unicode strings. This makes sure that characters from non-Latin alphabets can easily be inserted into the database.

SET standard_conforming_strings = off;

The standard_conforming_strings setting ensures some backwards compatibility with PostgreSQL behavior – we can use ‘' as an escape character in strings. When this is set to off, PostgreSQL will use the ‘' as an escape character. The goal is to, some day, default this setting to on.

SET check_function_bodies = false;

We are, in effect, recovering from a database dump. By setting check_function_bodies to false we tell PostgreSQL that we want to put off a lot of checks until later because things might not be completely set up at the moment and we’ll accept the risks of doing this, thanks. I would never recommend using this setting in a production environment unless you’re recovering from a data dump. If that’s the case, knock yourself out.

SET client_min_messages = warning;

This controls the amount of messages that we are going to receive from PostgreSQL. Leaving this alone is a good thing but it’s important to know that you can mute all messages from the server if you so desire.

SET escape_string_warning = off;

The escape_string_warning setting actually goes along with the standard_conforming_strings setting and has to do with how PostgreSQL will behave when it encounters a ‘' character rampaging about in our code.

SET search_path = public, pg_catalog;

There’s a history in the PostgreSQL world of omitting the schema qualification from table names in queries. So, what we’re doing here, is telling PostgreSQL that we would like to look for table first in the public database schema and then, second, in the pg_catalog schema. Don’t worry if you don’t know what schemas are all about, I’ll explain them later. Right now you can think of schemas as if they were folders for holding tables, functions, and the like. The search_path variable works just like the PATH variable on your operating system – it’s where PostgreSQL is going to look for functionality.

The Anatomy of an Insert

I wish that I could tell you that there is something complex going on with the pagila inserts. But, unfortunately, they’re about as boring as an insert could possibly be:

INSERT INTO  -- PostgreSQL needs to listen up, we have
             -- data coming in.
actor        -- We're adding data to the actor table,
             -- specifically into the following columns
(actor_id, first_name, last_name, last_update)
VALUES       -- and here is a list of the values that
             -- we are going to insert.
(1, 'PENELOPE', 'GUINESS', '2006-02-15 09:34:33');

That’s all there is to a basic insert statement – we tell PostgreSQL that we want to add a new row to the database in a specific table and then supply a list of columns and values that we would like to add. Similar code happens about several hundred more times in that file.

This is made slightly more interesting by ALTER TABLE actor DISABLE TRIGGER ALL;. This code ensures that any and all triggers on the table actor will not fire while we’re inserting our rows. Why would we want to bypass all of the triggers? Well, we don’t want any extra validation to run, nor do we want a trigger to run that will populate the primary key column of our tables with the value from a sequence – we already have those values and our insert scripts would be much more complicated if they were filled with lookup queries too.

Adding a Brand New Record

Let’s go ahead and add a new movie. Pagila is a movie rental store and we’re not going to be able to keep up with the competition without adding a new movie. Keeping with Pagila’s standard of using nothing but gibberish movie tiles, we’re going to be adding the movie “Poseidon Sundance.”

Of course, creating the movie is not as easy as simply adding the title and walking away. People need to be able to make sure that the movie is acceptable for their tastes, has their favorite actor, or maybe just find out if the film is kid friendly.

INSERT INTO film
-- Here we specify the table where we're going to add the data,
--  and then list the columns where we're going to actually
--  add the data. You don't need to list out all of the
--  columns.
-- P.S. Lines that start with two lines like this are comments.
--  The PostgreSQL command parser will ignore them.
(
  title,
  description,
  release_year,
  language_id,
  original_language_id,
  rental_duration,
  rental_rate,
  length,
  replacement_cost,
  rating
)
VALUES
(
  'POSEIDON SUNDANCE',
  'A thrilling story of a dog and a moose who have to journey across Finland to be reunited with their owners',
  '2006',
  1, -- language_id, this is English
  6, -- original_language_id, this is German
  3, -- rental_duration
  4.99, -- rental_rate
  114, -- length in minutes
  24.99,
  'G'::mpaa_rating
) ;

SELECT * FROM film WHERE title = 'POSEIDON SUNDANCE' ;

-- Ee need to put the film into categories so people can find it.
INSERT INTO film_category
(
  film_id,
  category_id
)
SELECT  film_id ,
        9 -- It's a  foreign film
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;
/* This is also a comment. It's a multi-line comment, just like in
   C++, C#, or Java.

   We're combining an INSERT with a SELECT because we need to find out
   the film_id of the movie we just added. There are occasions where
   we could take a different route, but we'll cover those later.
 */

INSERT INTO film_category
(
  film_id,
  category_id
)
SELECT  film_id ,
        8 -- films about animals are almost always family movies
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

-- Add the main actors
INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  71 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  123 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  97 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  49 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

INSERT INTO film_actor
(
  actor_id,
  film_id
)
SELECT  3 ,
        film_id
FROM    film
WHERE   title = 'POSEIDON SUNDANCE' ;

Once we’ve created our movie, let’s try to find it in the database. We can do this by writing a simple query. Don’t worry if you don’t understand everything that’s going on, everything is going to be explained in good time.

SELECT  *
FROM    film f
        JOIN film_actor fa ON f.film_id = fa.film_id
WHERE   f.title = 'POSEIDON SUNDANCE' ;

If you’ve done everything right, this should return five rows – one for each actor in the movie.

SELECT  *
FROM    film f
        JOIN film_category fc ON f.film_id = fc.film_id
WHERE   f.title = 'POSEIDON SUNDANCE' ;

And this should only return two rows – one for every category the movie is in.

A Flaw in the Plan

When I was originally writing this tutorial, I didn’t pay any attention to some of the views in the database that use the film table (views are magical virtual tables that we’ll talk about later). I was getting ready to test things and I discovered a big flaw in and I thought I would highlight it here.

SELECT  *
FROM    film_list fl
WHERE   fl.title = 'POSEIDON SUNDANCE' ;

This will return two rows for you. Why?

Well, it turns out whoever wrote the film_list view assumed that each movie will only belong to one category even though it can feasibly belong to as many as we want. There are a few places that we could fix this error. Before reading on see if you can think of how we could fix this.

Fixing the Flaw

There are multiple places we could fix this. The first place would be in the definition of the view. We could change the view to display all of the categories in a comma separated list, just like we’re doing for all of the actors in the film.

The next place that we can fix this is in the data itself. Apart from deleting the extra row, we can do one of two things. We can create a uniqueness requirement (called a unique constraint) on the film_category table so that only one film can be in the table at a time. This is a limiting feature of the application. My preference would be to create a unique constraint on both the film_id and category_id columns. That ensures that we will never have a duplicated row for a film and category combination, but we can still add multiple categories. Combining this with changes to the view fixes this flaw in our database design.

We’ll cover this issue later.

Summary

So, there you have it – inserts are very simple statements. We can insert entirely new data that comes from somewhere outside of the database, we can insert data that’s a mash up of other information in the database, or we can do something in between.

What’s important to take away from this is that it’s very easy to insert information into PostgreSQL. We tell PostgreSQL that we’re going to insert data into a table, list the columns that we’re going to fill with data, and then define the data that we’re going to insert.

MongoDB – Now With Sharding

Version 1.6 of MongoDB was released last week while I was at devLink. What makes this an important point release? Two things: sharding and replica sets.

Sharding is a way of partitioning data. Data will be sharded by a key – this could be based on zip code, customer number, SKU, or any other aspect of the data. Sharding data in MongoDB occurs on a collection by collection basis. If you want related data to be on the same server you’ll need to pick a sharding key that is present in every collection. However, this design decision make sense because some collections may grow faster than others. The point of sharding is, in part, to spread load across multiple servers.

What makes this important? Previously, sharding has been difficult to set up and administer and required custom application code to be written and maintained. More important, though, sharding gives MongoDB the ability to scale across multiple servers with minimal effort.

What would happen if a single node in a set of sharded servers got very busy? Well, MongoDB would detect that one of the nodes is growing faster than the others and it would start balancing the load across the other servers. This might seem like it would violate my earlier statement about how we set MongoDB up to using a sharding key that we define. Here’s the catch: MongoDB only uses that sharding key when we set things up and when there are no problems. If things start getting busy, it will make changes to the sharding key. Those changes get reported throughout the entire cluster of servers and everyone knows where their data is, although nobody outside of the cluster really needs to care.

Replica Sets

Replica sets are a new and improved way to perform data replication in MongoDB. Basically, we set up replication in a cluster of servers. If any single server fails, another server in the replica set will pick up the load. Once we’re able to get the dead server back up and running, the replica set will automatically start up a recovery process and our users will never know that there was an outage.

There can be only one master server at any given time, so this protects us from master server failures. Through the magic of network heartbeats, we can be aware of all of the servers in the replica set. Interestingly, the master server is determined by a priority setting that is assigned to each server. This way, we could use older hardware to serve as a backup (or read-only server) to the master and use faster machines in the replica set to take over from the master in the event of any kind of hardware failure.

How It Works

MongoDB Sharding Diagram

MongoDB Sharding Diagram

Basically, here’s what happens (if you want more details, please see the Sharding Introduction):

  1. The mongos server is a router that makes our complicated MongoDB set up look like a single server to the application.
  2. The mongod config servers maintain the shards. They know where data is stored and will attempt to balance the shards if any single node gets out of whack.
  3. Replica sets provide localized redundancy for each shard key.

Gotchas

There are a few things to be aware when you’re considering sharding with MongoDB:

  1. If a configuration server goes down, you can no longer reallocate data if any shards become write hot spots. This meta-data must be writeable for data to be repartitioned. You can still read and write data, but load will not be distributed.
  2. Choose sharding keys wisely. An overly broad sharding key will do you no good: all data can end up on one node and you will be unable to split the data onto multiple nodes.
  3. Some queries will use multiple shards – make sure you understand data distribution, querying patterns, and potential sharding keys.

Photo Credits

glass litter by psyberartist – Creative Commons Licensed
I thought I saw a puddy cat… by Keven Law – Creative Commons Licensed

Kendra Little at CBusPASS

Hey, good news! Kendra Little (blog | twitter) will be presenting for us, remotely, on August 12th at 6:30 PM. The meeting is at the usual place, Battelle for Kids – 1160 Dublin Rd Suite 500, Columbus, OH 43215.

What is Kendra going to be talking about?

Stay Agile, Stay Sane

Agile software development emphasizes continuous depolyment and its methods do not directly include long term planning. DBAs must ensure data integrity and have a long term view for application scale, so Agile methods present challenges. Come learn about successful real-world practices iteratively developed in a high transaction internet service environment over the last five years. We have create a flourishing Agile shop while meeting high requirements for uptime, customer response, and data consistency. In this session we’ll cover key habits for success, practices to avoid, how and when to get started, and why Agile development can be a great thing for DBAs. Topics will also include how “Testing in Production” can be a huge benefit.

About Kendra

Kendra Little is a Senior DBA in the online advertising industry who has spent ten years nerding out on SQL Server. Kendra works closely with an Agile development team to deploy frequent incremental changes to scale and improve a busy production environment. She likes tuning production servers, developing tools to automate tasks, building SQL Reports for trending, and secretly enjoys writing troubleshooting guides and documentation. Kendra has a Masters degree in Philosophy and a salt shaker full of certifications, but all the best stuff she’s ever learned has come from her smarty-pants colleagues and the SQL Server community. Read her blog at http://littlekendra.com.

Object-Oriented Programming Concepts For DBAs

This all came about when Aaron Nelson (blog | twitter) asked me a few questions about object-oriented concepts. I did the best that I could to answer his questions, but I decided that this needed a real answer that will, I’m hoping, help to bridge the language gap between DBAs and developers.

Note to the reader: Please keep in mind that this is based on the .NET framework and specifically my experience with C#. If you want to highlight differences for your own language of choice in the comments, feel free.

Class

A class is a definition that we’ll be using later when we create objects. It’s a generic blueprint for building something else. It describes all of the properties that an object will have once we create one. In addition, classes also describe all of the behaviors (called methods) that an object will have.

DBAs, you can think of a class as similar to DDL. Just as DDL defines a the rows of a table, a class defines the objects that we’ll create later.

Object

“An object can be said to be an instance of a class.” WTF does that mean?

Well, an object is an instance of a class. That means that any time we create an object, we base if off of some kind of template. The object, whatever it may be, is just an implementation based off of some kind of blueprint. In the .NET world, an object can only be defined by one class.

Once we create an object (using something called a constructor), it’s different from the class. Like a zombie, the object takes on a life of its own. Let’s take a look at an example.

Cars have common properties – they all have four wheels, an engine, and a number of doors that is greater than 0. Individual cars have a lot of differences: different paint and upholstery colors, different trim levels, stereos, engines, transmissions, etc. When you go down to the car dealership (the constructor), you don’t just ask for a generic car. You have to be more specific and tell the dealer that you want a bright red 1978 Lincoln Continental with white wall tires. The dealer will get you the car (the constructor instantiates the object) and you can pimp in style.

Depending on how things are set up, we can change the properties of our object after we’ve created it. Properties can be modified with new values, or we can call methods on the object to make complicated changes to the object’s state.

In a round about way, an object is similar to a row. It’s a single instance of data. This is a bit of a crappy metaphor because a single object can actually contain deep structure that would be difficult to represent in a single row in the database.

Fields

A field is a basic variable – a string or number. It’s the simplest way to store data in an object. You can think of it as roughly akin to a column. You may also hear people refer to fields as member variables, instance variables, or any of a number of terms.

Properties

In the .NET world properties are wrappers around fields. They aren’t quite methods, but they aren’t quite fields. One of the interesting things about properties is that they let us enforce rules or build complex ways to describe an object from a set of simple data points.

public class Person {
  // other nonsense goes here...

  // some fields
  private string _firstName;
  private string _lastName;

  public string FirstName {
    get { return _firstName; }
    set { _firstName = value; }
  }

  public string LastName {
    get { return _lastName; }
    set { _lastName = value; }
  }

  public readonly string FullName {
    get { return _firstName + " " + _lastName; }
  }
}

In the real world, we’d do things far more complex than concatenating first name and last name to make a full name. We’d probably use a comma and put things in the order of last name first. Or maybe we’d calculate order price based on county sales tax and order line items and shipping fees.

An astute reader might notice that properties bear an uncanny similarity to constraints in a database. Let’s look at a different example. A class for a clock might look like this:


public class Clock {
  private int _hours;
  public int Hours {
    get { return _hours; }
    set {
      if (value >= 0 && value <= 23) {
        _hours = value;
      } else {
        throw NotOnPlanetEarthException();
      }
    }
  }
}

And a similar table would look like this:


CREATE TABLE clock (
  [hours] INT CHECK (h BETWEEN 0 AND 23)
);

It’s possible to create deeply nested structures of fields and properties. You can, for example, model a store thusly:


public class Store {
  private List _employees;
  private List _products;

  // I know this isn't how I should really model it. Shut up.
}

And this is all well and good in code. And there is a way to model this in the database that should seem relatively obvious using join tables (employees can work in more than one store and products can be sold in more than one store).


CREATE TABLE stores (
  StoreId INT IDENTITY(1,1)
  -- real implementation left out because I don't want to write it
);

CREATE TABLE Employees (
  EmployeeId INT IDENTITY(1,1)
);

CREATE TABLE Products (
  ProductId INT IDENTITY(1,1)
);

CREATE TABLE StoreEmployees (
  StoreId INT,
  EmployeeId INT
);

CREATE TABLE StoreProducts (
  StoreId INT,
  ProductId INT
);

The problem is that these two models don’t work well together. We call this object-relational impedance mismatch. I wrote about a lot of the problems in an older blog post, O/R-Ms: Panacea or Polio Braces?, so I won’t get into it here. Needless to say, there are some major problems that can arise.

Methods

Methods are the workhorses of object-oriented programming. These are the verbs of programming. A method is a bunch of statements that are executed in order and achieve some result. Methods are roughly analogous to stored procedures or functions (depending on your database).

Inheritance

The programming world would be incredibly painful if we had to keep repeating code every time we want to do the same thing. Inheritance helps us solve this problem, in some cases.

All cars have similar properties – they have engines, doors, wheels, seats, and various other car things. Some cars have different traits – a 1967 Plymouth Fury Station Wagon drives a lot differently than an 2010 Aston Martin DB9.

public class Car {
  // details go here
}

public class StationWagon : Car {
  // more details
}

public class SportsCar : Car {
  // very fast details
}

Inheritance is a way to work with the same root data and behaviors but provide additional specialization. Our SportsCar class might provide a SpinTires or BurnOut method whereas the StationWagon is not capable of performing a burn out.

Another way to think about it is to say that inheritance (and also interfaces) is a way of describing “is a” relationships. A SportsCar is a Car. Anywhere that we can use a Car, we can also use anything else that is a Car. So, we can use a StationWagon in any method that expects a car. But, since a StationWagon isn’t a SportsCar, it won’t work there. See how that works?

If this seems like a huge set of abstractions, you’re absolutely right. A lot of the concepts of object-oriented programming can be implemented using purely procedural code. It’s considerably easier to express some concepts using object-oriented code. There is some performance overhead, but many developers consider that the performance penalty is made up by the ease of development.

Interfaces

The duckroll fully implements IWheeledVehicle

Interfaces are where things get a little bit tricky for some people. An interface is a contract. When we say that a class implements an interface, we’re really saying that the class has all of the methods defined in the interface. An interface is an abstract type that can’t be directly created. Interestingly enough, though, we can create classes that implement an interface and then use the class anywhere that the interface is accepted.

If we create an IWheeledVehicle interface (interfaces in the .NET world are typically named with an I), we can define methods that accept an IWheeledVehicle parameter. We can use any object that provides an implementation of IWheeledVehicle with that method. Even if that object is a duckroll.

In Summation

Object-oriented programming is a strange concept when you first encounter it, but it provides useful abstractions that make it easier for developers to work with data. Understanding how developers might be consuming data is going to make it easier for you to work with them and create amazing applications.

If you’re going to be doing a lot of work with developers for any length of time, it would potentially be a good idea to take a look at a book or two on the language they’re using. You don’t need to become proficient, but you should understand the concepts they’re working with.

Special thanks to Matt Nowack (blog | twitter) and Mike Peschka (blog | twitter) for reviewing this and providing feedback. Also, thanks to Aaron Nelson (blog | twitter) for the original questions that sparked this blog post.

PostgreSQL Tutorial – Referring to Other Tables

Referring to Other Tables

In the previous article you learned how to create a single table. One table does not make for a very interesting database. Since we looked at how the actor table was created, let’s take a look at the most logical follow up: the film table.

The film table

Without going into all of the details of the film table, we do know that the table looks something like this:

CREATE TABLE public.film  (
    film_id                 int4 NOT NULL DEFAULT nextval('film_film_id_seq'::regclass),
    title                   varchar(255) NOT NULL,
    description             text NULL,
    release_year            year NULL,
    language_id             int2 NOT NULL,
    original_language_id    int2 NULL,
    rental_duration         int2 NOT NULL DEFAULT 3,
    rental_rate             numeric(4,2) NOT NULL DEFAULT 4.99,
    length                  int2 NULL,
    replacement_cost        numeric(5,2) NOT NULL DEFAULT 19.99,
    rating                  mpaa_rating NULL DEFAULT 'G'::mpaa_rating,
    last_update             timestamp NOT NULL DEFAULT now(),
    special_features        text NULL,
    fulltext                tsvector NOT NULL,
    PRIMARY KEY(film_id)
)

There are a few features that will be covered later – like enumerations and full text searching – but this is a pretty standard table. The one thing that’s missing, though, is a way to reference the actors in each film. We could do this by adding a number of columns (actor1_id, actor2_id, etc.) but that would become cumbersome and is not an effective way to model data.

The alternative is to create what is commonly called a cross reference table.

Cross Reference Tables

A cross reference table is a simple way to create a cross reference between two objects that have a many-to-many relationship. This is typically done using a two column table – one column for the primary key of each table:

CREATE TABLE public.film_actor  (
    actor_id    int4 NOT NULL,
    film_id     int4 NOT NULL,
    last_update timestamp NOT NULL DEFAULT now(),
    PRIMARY KEY(actor_id,film_id)
)

Here’s the problem: as it stands, we could insert any values into this table. The whole point of the film_actor table is to map valid actors with valid films. Without any restrictions in place, we can’t make sure that any actor or film id in our table isn’t changed to an invalid value.

Enter Referential Integrity

Referential integrity is a way of saying that every value of a column in one table must exist as a value in a column in another table. In other words: every value in the referenced column or columns must exist in the referenced table. PostgreSQL makes it easy for us to enforce referential integrity through the use of foreign keys.

In order to make sure that we only have valid actors in the film_actor table, we’ll create a foreign key:

ALTER TABLE public.film_actor
ADD CONSTRAINT film_actor_actor_id_fkey
FOREIGN KEY (actor_id)
REFERENCES actor(actor_id)
ON UPDATE CASCADE
ON DELETE RESTRICT

That’s a very shorthand way of saying

The table "public.film_actor"
has a foreign key constraint named "film_actor_actor_id"
on the column "actor_id"
which references the column "actor_id" in the table "actor"
  so that every value of "actor_id" in "public.film_actor"
  must also be present in the "actor_id" column
  of the table "public.actor"
if the value of "actor_id" is changed in the "public.actor" table
  we should update all values in "public.fim_actor" that are the same
and if the value of "actor_id" is deleted from the "public.actor" table
  we should prevent that delete from happening if there are rows
  in the "public.film_actor" table with the same value

We’ll create a similar foreign key to enforce referential integrity to the film table:

ALTER TABLE public.film_actor
ADD CONSTRAINT film_actor_film_id_fkey
FOREIGN KEY (film_id)
REFERENCES film(film_id)
ON UPDATE CASCADE
ON DELETE RESTRICT

This foreign key is almost identical to the first one. The only difference between the two are the names of the table being referenced. If you’ve taken a look at the other tables in the database, or if you’ve looked back at the previous article about creating the schema, you may have noticed something else: all of our foreign keys reference the primary key of another table.

Foreign keys must refer to a unique value, this prevents us from creating a foreign key that points to any column. However, since a primary key requires every value in that column be unique, it uniquely identifies the row, we can safely create a foreign key that points to the primary key column.

What if we want to create a foreign key that points to a different column? We can do that too. PostgreSQL has a feature called a unique constraint. This is an internal database feature that makes it possible to ensure that all values in a column, or set of columns, are unique. A unique constraint is actually implemented as a unique index in PostgreSQL, just as it is in many databases.

CREATE UNIQUE INDEX idx_unq_manager_staff_id
ON public.store(manager_staff_id)

This index wouldn’t be a good candidate for a foreign key, but it does illustrate how you can create a unique index that you can reference with a foreign key.

References

Table Design Patterns: Cross-Reference Validation

NoSQL Summer Reading List

For those of you who aren’t as much into reading up on different types of database, there’s an interesting summer reading list going on right now over at A NoSQL Summer. Unfortunately, I’m not lucky enough to live in a town with a NoSQL Summer group (not that I know of, at least) and I’ve had too much on my plate to start one up. But I still wanted to read all of the papers. What’s a poor guy to do?

Instead of navigating a bunch of web pages and downloading some PDFs, I decided to automate the process and write a tiny program to do it for me. I turned to my favorite rapid fire language, Ruby, and fired off a quick script to parse the web pages and get me the content that I was looking for.


#!/usr/bin/ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'net/http'

# path to the target directory, you'll probably want to change this...
# unless your account is named 'jeremiah'
base_folder = "/Users/jeremiah/Desktop/NoSQL"

# open up the list of papers
doc = open('http://nosqlsummer.org/papers') { |f| Hpricot(f) }

# find all of the links to each paper and loop through them
doc.search("//div[@class='o-papers on']/a").each do |link|
  # ignore the closing tags.
  # there's probably a better way to do this,
  # but I wrote this in 15 minutes at 11:30 at night
  next unless link.is_a? Hpricot::Elem

  paper_doc = open("http://nosqlsummer.org/#{link.attributes['href']}") { |f| Hpricot(f) }

  # get the necessary elements to build our document name for saving
  difficulty = paper_doc.at("h4[@class*='difficulty']")['class'][-1,1]
  title = (paper_doc/"div[@class='o-paper on']/h1").inner_text
  download_link = paper_doc.at("a[@class='download']")['href']

  begin
    # try to save
    puts "Attempting to download #{title} from #{download_link}..."
    write_out = open("#{base_folder}/#{difficulty}_#{title}.pdf", "wb")
    write_out.write(open(download_link).read)
    write_out.close
  rescue Exception
    puts "  *** v^v^v^ error ^v^v^v ***"
  end
end

This script very neatly downloads everything to the directory of your choosing (change the directory name). It also thoughtfully names the files with their difficulty rating as the first character so you can sort them ASCII-betically and make a halfway decent list to help your learn your way into NoSQL nerdery.

There’s only one problem. One of the papers, the graph traversal paper, won’t download for some reason. The ACM server returns an HTTP access denied error code. To get around this you can either download it with your browser, or you can go ahead and use the copy that I’ve provided – The Graph Traversal Pattern.

Enjoy!

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.