Category NoSQL

Cassandra Tutorial – Installing Cassandra

I was feeling a little bit crazy the other day and I installed Cassandra on my home computer. Twice. Just because. It wasn’t really a “just because” moment. I installed Cassandra twice so I could get the hang of it in case anyone asked me how they could go ahead and get started. The process was relatively painless, but there are some prerequisites that you need to have installed before you can get started. I thought that it would be good to create this short Cassandra installation tutorial to help you get started.

This video walks through everything you need to do get Cassandra installed and running on your computer. Apart from installing the OS. If you don’t have an OS, I don’t know how you’re reading this, probably witchcraft.

[media id=3 width=800 height=600]

Three Things to Watch with NoSQL

Last week I painted a rosy picture of NoSQL databases. Before you deploy any kind of NoSQL database, you need to be aware of the potential pitfalls of NoSQL databases.

1) Architecture

NoSQL databases have different architectural concerns than traditional RDMBSes.

If you’re wokring with a BigTable or Dynamo database, your network backbone will need to be able to handle the increased traffic from replication and materializing MapReduce data. Dynamo databases are very chatty – reads and writes are only guaranteed when a specific number of servers respond to a query.

MapReduce queries run across many nodes in a cluster. One stage processing a MapReduce query is temporarily materializing intermediate results from each node. The intermediate results may then be shipped to another server for additional processing, aggregation, or just be combined into the final result set.

2) Querying Strategy

RDBMSes are able to perform a mix of batch and ad hoc querying. They produce reasonably fast results for doing online aggregation and analysis of data. It’s just as easy to write a query that returns a single row from a relational database as it is to write a query that aggregates data from several thousand rows.

Not all NoSQL databases are created equally. Hadoop and Cassandra are built to perform large scale analytical queries. Their implementation of MapReduce is designed to process batches of analytical results. Frequently these results are going to be pre-aggregated rather than delivered in real time. Riak and MongoDB, on the other hand, are designed to handle interactive querying.

3) Data Modeling

Data modeling is a tricky field of study in the world of relational databases. A huge number of books have been written on the subject and they deal with both the theoretical and practical aspects of modeling data. In short, while data modeling is tricky it is also a well understood field of study.

Data modeling in the non-relational world is an entirely new field. Many problems that have been solved in relational data modeling are new and require a different way of thinking. BigTable databases, while their structure may initially appear familiar, require different modeling techniques to get the most out of the data store. MongoDB’s document oriented approach makes a great deal of sense to developers, but it still poses a number of data modeling questions about how to link related objects and how deeply nested object graphs should be in the database. Key/Value stores, like Riak, have very little structure and give the application developer a huge amount of freedom to model data.

Summation & Further Reading

NoSQL Databases hold a lot of promise to help businesses rapidly respond to data growth. Whenever you’re planning to make a major infrastructure decision, you need to weigh the pros and cons before determining which solution is right for you. While NoSQL databases provide answers to a lot of questions, they also pose a different set of questions that require a different set of skills to answer.

NoSQL Glossary
MongoDB – Sharding with the Fishes
Hadoop for Batch Processing (from the Yahoo! Developer Network)
Schema Design in Riak
Schema Design for Riak (presentation slides)

Five Reasons to Use NoSQL

You’re not using a NoSQL database right now? I’m not all that surprised. Many IT shops are still evaluating moving from SQL Server 2000 to SQL Server 2005, much less these non-relational databases. A lot of people don’t even know what they would use a NoSQL database for. Does it replace the RDBMS? Work alongside it? Do something else?

1) Analytics

One reason to consider adding a NoSQL database to your corporate infrastructure is that many NoSQL databases are well suited to performing analytical queries. Developers can use the same querying languages to perform analytical queries that they’re using to perform atomic queries. Typically this will be some variation of a MapReduce query, but it’s also possible to query data using Pig or Hive. Don’t worry too much about these weird language terms, MapReduce is a fancy way of saying “SELECT and then GROUP BY” and doing it in a way that is entirely confusing to people who are used to SQL.

Many NoSQL systems boast phenomenal write performance. When you combine high write performance with batch processing it is easy to pre-aggregate data, summarize results, and still guarantee ad hoc query performance.

2) Scale

NoSQL databases are designed to scale; it’s one of the primary reasons that people choose a NoSQL database. Typically, with a relational database like SQL Server or Oracle, you scale by purchasing larger and faster servers and storage or by employing specialists to provide additional tuning. Unlike relational databases, NoSQL databases are designed to easily scale out as they grow. Data is partitioned and balanced across multiple nodes in a cluster, and aggregate queries are distributed by default. Scaling is as easy as racking a new server and executing a few commands to add the new server to the cluster (yeah, it really is that easy). Data will start flowing and you’ll back in business in no time.

3) Redundancy

In addition to rapid scaleability, NoSQL databases are also designed with redundancy in mind. These databases were designed and built at massive scales where the rarest hardware problems go from being freak events to eventualities. Hardware will fail. Rather than treat hardware failure as an exceptional event, NoSQL databases are designed to handle it. While hardware failure is still a serious concern, this concern is addressed at the architectural level of the database, rather than requiring developers, DBAs, and operations staff to build their own redundant solutions. Cassandra uses a number of heuristics to determine the likelihood of node failure. Riak takes a different approach and can survive network partitioning (when one or more nodes in a cluster become isolated) and repair itself.

4) Flexibility

What’s the use of a database if it’s not flexible? While the data modeling issues are completely different in NoSQL, there is a large amount of flexibility in how data is stored for performance.

Databases modeled like Bigtable and Cassandra provide flexibility around how data is stored on disk. It’s possible to create derived column families. In plain English: you can design your database to duplicate frequently accessed data for rapid query response. This is, of course, based on the assumption that writes and storage space are cheap.

Databases based on the Bigtable model also have another benefit – outside of key structure it’s possible to store a variety of disparate data in the same table. Structure is largely irrelevant. Relational databases have adopted features to solve similar problems (such as sparse columns in SQL Server), but they carry overhead. Storing wildly different columns in multiple rows of the same column family is so cheap as to be invisible in a NoSQL database.

Lastly, key-value stores provide an incredible level of flexibility. Data is arbitrarily stored as a value. Key-value databases make it possible to store images, word documents, strings, integers, and serialized objects within the same database. This requires more responsibility and creative thinking on the part of application developers and architects but it also lets the people designing the system build custom a completely custom solution that fills their needs.

5) Rapid Development

Let’s face facts: everyone wants their application to be faster, have more features, and they want it yesterday. NoSQL databases make it easy to change how data is stored or change the queries you’re running. Massive changes to data can be accomplished with simple refactoring and batch processing rather than complex migration scripts and outages and it’s even easier to take nodes in a cluster offline for changes and add them back into a cluster as the new master server – replication features will take care of syncing up data and propagating the new data design out to the other servers in a cluster.

Moving Away From Relational Storage – Introduction

Don’t. Fooled you, didn’t I?

If you’re already using a relational database, keep using it. If it’s scaling just fine with your hardware and workload, keep using it. If you aren’t running into any complexity problems, keep using it.

There’s no reason to change the way you’re storing your data just because you read an article about how BrandNewStartup.com was able to increase uptime and throughput eleventy-four percent by utilizing a new key-value storage solution. That’s like re-writing your core product every time you read an article about how Ruby has excellent metaprogramming features or how Python’s use of significant whitespace can lead to more readable code. These are reasons to choose a language for a new product, but they are not reasons to change the language of an existing project.

If you’re not supposed to move away from relational storage, what is this all about?

When Should I Switch?

When you add new features to an application, or when you rewrite a feature, take a look at what you need. When we needed to make use of large query result caching in a web application, we started using Velocity, which later became AppFabric. There’s some overhead serializing and deserializing objects into the storage mechanism, but could you imagine trying to dump that data to the database? The write overhead would be tremendous! We looked around at products that would work well with our existing application infrastructure, ASP.NET and SQL Server, and chose something that would play well in that garden, Windows Server AppFabric.

Before you start new you should ask yourself questions about what you’re trying to build. Question your assumptions about how the new feature or product will work. Question your assumptions about the existing infrastructure. Make sure that you aren’t shoehorning existing technology into a solution because you are familiar with it. Just because something feels familiar and safe doesn’t mean that it’s the best solution for a problem.

What Questions Should I Ask?

This is the trickiest part. It’s the part that I’ve struggled with and gotten wrong on more than one occasion. It’s okay to screw up, it’s how we learn.

Here are the questions I’ve started asking myself when I start a new project/feature that needs additional data storage:

  • Why are we storing this data?
  • How much data will I collect in a week? A month? A year?
  • How long will this data need to live?
  • How will this data be used?
  • How structured does this data need to be?
  • How available does this data need to be?

Why Are We Storing This Data?

You need to understand how you’ll be using the data before you can figure out how you want to store it. Will you be doing ad hoc reporting? Will the data be aggregated and consumed by other applications? How often will I write this data? How often will I read my data?

Different types of databases have different use case profiles. The way that you’re using the data will make a big difference for how you’re going to store data. You don’t want to store session state in a relational database – you’ll spend a lot of time writing transient data to disk. Likewise, you don’t want to store financial transactions in an in-memory cache.

How Much Data Will I Collect?

The volume of data that you’re working with will influence the way that you’re storing the data. Terabytes, and even petabytes, of data require different storage techniques and management strategies in a relational database, why would it be different anywhere else? Handling huge quantities of data often requires splitting the load across multiple servers or purchasing a SAN. Either way, you’ll want to consider long term budget and how the availability of those budget dollars might change over time. Just as important as the long term capacity of your data is the speed you’re collecting data. The faster you need to collect data, the more you need to look at how you’re storing that data.

Every database engine employs different strategies to maximize I/O throughput. The problem here is that they all use different strategies based on their use case.

How Long Will This Data Need to Live?

The lifespan of data is incredibly important. Short term data need never touch disk – it can live in memory. If this is something like session state data, it’s possible to use a 100% in memory storage solution (like memcached or AppFabric or even riak’s riakkvcache_backend) to solve this problem. Disks are slow, memory is fast.

If the data needs to live longer than a few seconds, then it’s time to consider how long it will really need to stick around. You need to look at different forms of persistence and how the strengths and weaknesses of those systems play into your long term choices. Some next-generation data stores will store data in memory and persist to disk in the background. This speeds up the ability to write data but it does bring up some data life concerns: what happens if the power fails? Other data stores use a write-ahead log, like relational databases, to make sure that the data is safe.

How Will This Data Be Used?

The way that end users are going to use our data is important for our decision making. Some data stores (such as CouchDB) do not allow ad hoc querying. Others (graph databases) make it possible to easily navigate deeply nested and complex data structures.

Relational databases are phenomenal general purpose data stores. They make it possible to store data in a variety of formats but there may be a variety of complications as a result of the general nature of SQL and the relational model. Massive volumes of data stored for statistical analysis have different storage and indexing requirements than data that needs to be instantly available from a variety of locations for atomic reads and writes.

How Structured Does This Data Need To Be?

Data structure can be important for a variety of reasons. Hierarchical data stores make it possible to traverse deeply nested category trees – think of how species are classified. Likewise, graph databases make it possible to navigate through complex relationships, much like the relationships that can be found in strongly object-oriented designs.

Considering how the structure of the data can make it very easy to make a decision on the data store that you’ll be using. If a lot of flexibility is required, it may even be possible that a Big Table derived database, such as Cassandra or HBase might meet your storage needs.

How Available Does This Data Need To Be?

One of the benefits of a relational database is that once a transaction commits, the data is immediately available to everyone querying that database (ignoring things like replication and log shipping). When you start working with distributed data stores, the data may not be immediately available. You need to ask yourself about how available the data needs to be. In some data stores, the delay may be only a few milliseconds. In others, it may be longer. Network latency and hardware utilization play a large role in the latency of data replication.

The other side of the availability question is “How fault tolerant does this data need to be?” With a relational database, unless you’ve implemented a clustering solution, the data is dependent on a single server functioning appropriately. If that single server goes down, business could stop until a replacement is brought back online. Some of the distributed data stores are massively fault tolerant – Riak is designed to be tolerant of hardware failure. If nodes fail, data will be replicated to new nodes, or even to new clusters, automatically. Replacement servers can be brought online and they will begin receiving new data immediately.

The immediately availability of data after a write as well as the long term (fault tolerance) availability of data is an important aspect of choosing a data storage solution.

What’s Next?

Ask yourself these questions the next time that you’re implementing new functionality or an entirely new application. The answers to your questions might very well surprised you.

In the next few weeks, I’ll be putting more thoughts together and showing examples of realistic scenarios of moving away from a relational database and utilizing a NoSQL solution.

Further Reading

NoSQL Summer
NoSQL Heroku and You
What is Big Data?
Big Data Is Less About Size, And More About Freedom
NoSQL: The Dawn of Polyglot Persistence

…And This Is Why I Love Open Source Software

From the daily Riak Recap:

10) So check this out:

At 17:40 yesterday, solidsnack asked, “Does Riak support HTTPS?”

At 22:28, benblack made it work —>
https://gist.github.com/4413fafc14fc18696643

The repo for his branch is here —> http://github.com/b/riak/tree/https

A user requested SSL support. 4 hours and 40 minutes later, it was implemented and checked in to source control. It will probably be available to the public in a short period of time.

Happy Friday!

Things I Read This Week – 2010.08.03

Just in case you want to live vicariously through my reading choices, here are some of the things that I found interesting this week when I should have been doing my job.

Data

An Illustrated Guide to the PostgreSQL Buffer Cache – Just in case you were wondering how a buffer cache works in a database. Because, you know… it’s cool. For you SQL Server people out there, this is conceptually identical to how buffer pool scans work in SQL Server. I suspect this is the same everywhere and probably based on an obscure academic paper.

Inside the Optimizer: Plan Costing – The SQL Server optimizer is an interesting creature. You can’t see what it’s doing directly because it’s full of patents and secrets (and probably dragons). Paul White (blog | twitter) has been doing a phenomenal job recently of taking apart the optimizer step by step and showing how it works.

The problems with ACID and how to fix them without going NoSQL – You know all of those reasons your developers are giving you to abandon SQL Server for a NoSQL database? This guy claims to have the answer. It boils down to using stronger ACID compliance and sounds suspiciously like Postgre-XC. If you don’t want to read several lengthy academic papers (really? you don’t want to?), the basic premise is that we should use stronger ACID controls by strengthening isolation levels and making the database deterministic such that the outcome of three transactions (A, B, and C) on all nodes will be the same regardless of execution order.

Using Riak’s map/reduce for sorting – Many NoSQL databases use something similar to MapReduce instead of SQL. The magic of map/reduce isn’t always apparent (it isn’t always apparent to me, that’s for sure). This article gives a well-commented example of the code you would use to write an ORDER BY date DESC query.

Configuring Mongo Replica Sets – Kristina Chodorow covers how you would set up Replica Sets to eliminate any single point of failure in MongoDB. Good to know ;)

10 things you should know about about NoSQL databases

Code

RubyDoc.info – I haven’t really been reading this one so much as I am amazed by it. RubyDoc.info combs the Ruby source code in a few locations and dynamically generates up to the second documentation based on the current stable versions of libraries. Go go magic clouds!

TextMate’s Missing Drawer – Mac fiends, take note! I’ve always thought TextMate’s project drawer was a bit lacking. This really improves on the built in drawer and integrates it a lot better into the application. (Found via The Hidden Magic of TextMate.)

Other Stuff

The Importance of a Mentor – My good friend, and long time unknowing mentor, Jonathan Kehayias talks about the importance of having a mentor. Take some time and thank the people who have mentored you throughout your career.

MongoDB – Now With Sharding

Version 1.6 of MongoDB was released last week while I was at devLink. What makes this an important point release? Two things: sharding and replica sets.

Sharding is a way of partitioning data. Data will be sharded by a key – this could be based on zip code, customer number, SKU, or any other aspect of the data. Sharding data in MongoDB occurs on a collection by collection basis. If you want related data to be on the same server you’ll need to pick a sharding key that is present in every collection. However, this design decision make sense because some collections may grow faster than others. The point of sharding is, in part, to spread load across multiple servers.

What makes this important? Previously, sharding has been difficult to set up and administer and required custom application code to be written and maintained. More important, though, sharding gives MongoDB the ability to scale across multiple servers with minimal effort.

What would happen if a single node in a set of sharded servers got very busy? Well, MongoDB would detect that one of the nodes is growing faster than the others and it would start balancing the load across the other servers. This might seem like it would violate my earlier statement about how we set MongoDB up to using a sharding key that we define. Here’s the catch: MongoDB only uses that sharding key when we set things up and when there are no problems. If things start getting busy, it will make changes to the sharding key. Those changes get reported throughout the entire cluster of servers and everyone knows where their data is, although nobody outside of the cluster really needs to care.

Replica Sets

Replica sets are a new and improved way to perform data replication in MongoDB. Basically, we set up replication in a cluster of servers. If any single server fails, another server in the replica set will pick up the load. Once we’re able to get the dead server back up and running, the replica set will automatically start up a recovery process and our users will never know that there was an outage.

There can be only one master server at any given time, so this protects us from master server failures. Through the magic of network heartbeats, we can be aware of all of the servers in the replica set. Interestingly, the master server is determined by a priority setting that is assigned to each server. This way, we could use older hardware to serve as a backup (or read-only server) to the master and use faster machines in the replica set to take over from the master in the event of any kind of hardware failure.

How It Works

MongoDB Sharding Diagram

MongoDB Sharding Diagram

Basically, here’s what happens (if you want more details, please see the Sharding Introduction):

  1. The mongos server is a router that makes our complicated MongoDB set up look like a single server to the application.
  2. The mongod config servers maintain the shards. They know where data is stored and will attempt to balance the shards if any single node gets out of whack.
  3. Replica sets provide localized redundancy for each shard key.

Gotchas

There are a few things to be aware when you’re considering sharding with MongoDB:

  1. If a configuration server goes down, you can no longer reallocate data if any shards become write hot spots. This meta-data must be writeable for data to be repartitioned. You can still read and write data, but load will not be distributed.
  2. Choose sharding keys wisely. An overly broad sharding key will do you no good: all data can end up on one node and you will be unable to split the data onto multiple nodes.
  3. Some queries will use multiple shards – make sure you understand data distribution, querying patterns, and potential sharding keys.

Photo Credits

glass litter by psyberartist – Creative Commons Licensed
I thought I saw a puddy cat… by Keven Law – Creative Commons Licensed

NoSQL Summer Reading List

For those of you who aren’t as much into reading up on different types of database, there’s an interesting summer reading list going on right now over at A NoSQL Summer. Unfortunately, I’m not lucky enough to live in a town with a NoSQL Summer group (not that I know of, at least) and I’ve had too much on my plate to start one up. But I still wanted to read all of the papers. What’s a poor guy to do?

Instead of navigating a bunch of web pages and downloading some PDFs, I decided to automate the process and write a tiny program to do it for me. I turned to my favorite rapid fire language, Ruby, and fired off a quick script to parse the web pages and get me the content that I was looking for.


#!/usr/bin/ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'net/http'

# path to the target directory, you'll probably want to change this...
# unless your account is named 'jeremiah'
base_folder = "/Users/jeremiah/Desktop/NoSQL"

# open up the list of papers
doc = open('http://nosqlsummer.org/papers') { |f| Hpricot(f) }

# find all of the links to each paper and loop through them
doc.search("//div[@class='o-papers on']/a").each do |link|
  # ignore the closing tags.
  # there's probably a better way to do this,
  # but I wrote this in 15 minutes at 11:30 at night
  next unless link.is_a? Hpricot::Elem

  paper_doc = open("http://nosqlsummer.org/#{link.attributes['href']}") { |f| Hpricot(f) }

  # get the necessary elements to build our document name for saving
  difficulty = paper_doc.at("h4[@class*='difficulty']")['class'][-1,1]
  title = (paper_doc/"div[@class='o-paper on']/h1").inner_text
  download_link = paper_doc.at("a[@class='download']")['href']

  begin
    # try to save
    puts "Attempting to download #{title} from #{download_link}..."
    write_out = open("#{base_folder}/#{difficulty}_#{title}.pdf", "wb")
    write_out.write(open(download_link).read)
    write_out.close
  rescue Exception
    puts "  *** v^v^v^ error ^v^v^v ***"
  end
end

This script very neatly downloads everything to the directory of your choosing (change the directory name). It also thoughtfully names the files with their difficulty rating as the first character so you can sort them ASCII-betically and make a halfway decent list to help your learn your way into NoSQL nerdery.

There’s only one problem. One of the papers, the graph traversal paper, won’t download for some reason. The ACM server returns an HTTP access denied error code. To get around this you can either download it with your browser, or you can go ahead and use the copy that I’ve provided – The Graph Traversal Pattern.

Enjoy!

MongoDB – Basic Querying

I put together a little video tutorial showing you how to accomplish some basic querying with MongoDB.

[media id=2 width=558 height=410]

Download the sample code. A QuickTime/H.264/iPhone formatted video is also available – download it now.

The Future of Databases

The Story So Far

I’ve been in love with data storage since I first opened up SQL*Plus and issued a select statement. The different ways to store data are fascinating. Even within a traditional OLTP database, there are a variety of design patterns that we can use to facilitate data access. We keep coming up with different ways to solve the problems what we’re facing in business. The problem is that as the field of computer science advances, and businesses increase in complexity, the ways that we store data must become more complex as well. Exponentially scaling storage complexity isn’t something that I like to think about, but it’s a distinct possibility.

General purpose OLTP databases are not the future of data storage and retrieval. They are a single piece in the puzzle. We’ve been working with OLTP systems for well over 20 years. OLAP is a newer entry, bringing specialized analytical tricks (which are counter intuitive to the way relational data is stored) to the masses. Hell, there are a number of general purpose analytical storage engines on the market. These general purpose analytical databases integrate well with existing databases and provide a complement to the transactional specialization of OLTP systems.

That’s the key, OLTP databases are purpose built transactional databases. They’re optimized for write operations because way back in the dark ages it was far more expensive to write data to disk than it was to read from disk. Data couldn’t be cached in memory because memory was scarce. Architectural decisions were made. The way that we design our databases is specifically designed to work within this structure. A well designed, normalized, database has minimal duplication of data. In OLTP systems this also serves to minimize the number of writes to disk when a common piece of data needs to be changed. I can remember when I was a kid and the United States Postal Service changed from using three letter state abbreviations to two letter abbreviations. I have to wonder what kind of difficulties this caused for many databases…

In the 40 years since E.F. Codd’s paper was published, the programming languages that we use have changed considerably. In 1970, COBOL was still relatively new. 1971 saw the the introduction of C, 1975 brought us MS Basic. 1979, 1980, and 1983 saw Ada, Smalltalk-80, Objective-C, and C++ ushering in a wave of object oriented languages. Suddenly programmers weren’t working on singular data points, they were working with a object that contained a collection of properties. The first ANSI SQL standard was codified in 1986. 1990 gave us Windows 3 and the desktop PC became more than a blinking cursor. The web exploded in 1996, 2001, and continues to explode again in a frenzy of drop shadows, bevels, mirror effects, and Flash.

Throughout the history of computing, we’ve been primarily working with tuples of data – attributes mapped to values; rows to you and I. This model holds up well when we’re working with a entity composed of a single tuple. What happens, though, when the entity becomes more complex? The model to retrieve and modify the entity becomes more complex as well. We can’t issue a simple update statement anymore, we have to go through more complex operations to make sure that the data is kept up to date.

Examples Should Make Things Clearer

Let’s take a look at something simple: my phone bill.

In the beginning…

Long ago, a phone bill was probably stored in a relatively simple format:

  • Account Number
  • Name
  • Address
  • Past Due Amount
  • Current Amount Due
  • Due Date

This was simple and it worked. Detailed records would be kept on printed pieces of paper in a big, smelly, damp basement where they could successfully grow mold and other assorted fungi. Whenever a customer had a dispute, a clerk would have to visit the records room and pull up the customer’s information. This was a manual process that probably involved a lot of letter writing, cursing, and typewriter ribbon.

Eventually, this simple bill format would prove to be unreliable (P.S. I’m totally making this up just to illustrate a point, but I’m guessing it went something like this). In our example, there’s no way to tell when a customer paid or which customer was billed.

After some tinkering…

After a few more iterations, you probably end up with a way of storing a customer’s information and bills that looks something like this:

This is a lot more complicated from both a design perspective and an implementation perspective. One of the things that makes this implementation more difficult is that there are a number of intermediate tables to work with and these tables can become hotspots for reads as well as writes.

When you look at that design, be honest with yourself and answer this question:

How often will you view a single service history or general charge row?

Think about your answer. The fact is, you probably won’t read any of those rows on its own. You might update one if a change comes in from an external source, but otherwise all of the charges, history, etc on any given phone bill will always be read as a unit. In this particular instance, we’re always consuming a bill’s entire graph) at once. Reading a bill into memory is an onerous prospect, not to mention that summarizing phone bills in this system is a read intensive operation.

Fixing the glitch

There are a lot of ways these problems could be worked around in a traditional OLTP database. However, that’s not the point. The point is that there are problems that require actual workarounds. OLTP databases work well for many use cases, but in this case an OLTP database becomes a problem because of the high cost of reading vs writing. (Why should we read-optimize a system that was designed to be write-optimized when writes will probably account for only 10% of our activity, maybe less?)

I’ve hinted at how we fix the glitch at the beginning of this article – we look for a specialized database. In our case, we can use something called a document database. The advantage of a document database is that we’re storing an organized collection of values in the database. This collection of values is similar to a traditional tabular database – we have groups of similar data stored in named collections. The distinction comes in how the data is accessed.

When we’re saving a phone bill, we don’t have to worry about calling multiple stored procedures or a single complex procedure. There’s no need to create complex mappings between a database and our code. We create an object or object graph in the application code and save it. The software that we use to connect to our document database knows how to properly translate our fancy objects into data stored on a disk somewhere.

This solution has several upsides:

  • Related data is stored in close proximity on disk
  • Documents do not require strict structure
  • Documents may change properties without requiring complex changes to physical schema

Physical Proximity

My data is close together, so what?

In a traditional OLTP database, your data may be scattered across one or multiple disk drives. Physical drive platters will have to spin to locate the data on different parts of your storage medium. Drive read/write arms will have to move around in coordination with the spinning platters. The more complex your query, the more complex the dance your physical hardware will have to do; a simple high school slow dance turns into a tango.

In a document database, all of our data is stored together in a single record. When we want to read our bill, we just have to start reading at the beginning of the bill record and stop when we come to the end. There’s no need to seek around on the disk.

You might be worried that all of your data won’t be close together on disk. And you’d be right. However, many databases (including MongoDB) allow for the creation of secondary indexes to speed up data retrieval. The biggest question you need to ask yourself is “How will the applications be accessing the data?” In many applications we’re only acting on a single object. Even when our application isn’t acting on a single object, we can pre-aggregate the data for faster reporting and retrieval. When our application only works on a single object at a time, a document database provides distinct advantages – every time we need an object, we’ll be pulling back all of the data we need in a single read operation.

Strict Structure

Databases typically require data to be rigidly structured. A table has a fixed set of columns. The datatypes, precision, and nullability can vary from column to column, but every row will have the same layout. Trying to store wildly diverse and variable data in a fixed storage medium is difficult.

Thankfully, document databases are well-suited to storing semi-structured data – since our data is a collection of attributes, it’s very easy to add or remove new attributes and change querying strategies rapidly and in response to different data structure. Better yet, document databases let us be ignorant of how the data is stored. If we want to find all bills where the account holder’s last name is ‘Smith’ and they live in Virginia but the bill doesn’t have any county tax charges, it is very easy compared to constructing the query in a typical SQL database.

Using MongoDB our query might look like:

db.bills.find( { last_name : 'Smith' },
               { state : 'Virginia' },
               { charges : { type : 'county tax',
                             $exists : false } } )

Compared to similar SQL:

SELECT  b.*
FROM    bills b
        JOIN accounts a ON b.account_id = a.id
        LEFT JOIN charges c ON b.id = c.bill_id
                               AND c.type = 'county tax'
WHERE   a.last_name = 'Smith'
        AND a.state = 'Virginia'
HAVING  COUNT(c.id) = 0

And right about now, every DBA that reads this blog is going to be shaking with rage and yelling “But that SQL is perfectly clear, I don’t know how you can expect me to understand all of those curly brackets!” I don’t expect you to understand those curly brackets. Nor do I expect developers to understand SQL. The easiest way for us to develop is to use our natural paradigm. That’s why developers write code in C#, PHP, or Ruby and DBAs do their work in some dialect of SQL. MongoDB alleviates this because all the developers are doing is constructing a list of keys and values that must be matched before a document can be returned.

Changing the Schema

Changing the schema of an OLTP database can be an onerous task. You have to wait for, or schedule, down time. Modifications have to take place. Of course, the schema modifications need to take into account any actions (like triggers or replication) that may occur in the background. This alone can require significant skill and internal database engine knowledge to write. It’s not something that application developers should be expected to know. Why do I mention application developers? 99 times out of 100, they’re the ones who are working on the database, not a dedicated DBA.

Many newer, non-traditional, databases make it incredibly easy to change the schema – just start writing the new attribute. The database itself takes care of the changes and will take that into account during querying. When a query is issued for a new attribute, records without that attribute will be ignored (just like a column with a NULL value in a traditional database).

What about Analytics?

I don’t know a lot about analytical databases, in part because they require a different skill set than the one I’ve developed. I do know a few things about them, though.

Analytical databases are currently encumbered by some of the same problems as OLTP databases – data is stored in tables made up of rows and columns. Sure, these are called dimensions/facts and attributes, but the premise is the same – it’s a row-based data store.

Row-based data stores pose particular problems for analytic databases. Analytic databases throw most of the rules about normalization in the garbage and instead duplicate data willy nilly. Without joins, it’s very quick to query and aggregate data. But the problem still remains that there is a large quantity of repeated data being stored on disk.

Columnar databases attempt to solve this problem by compressing columns with similar values and using some kind of magical method to link up columnar values with their respective rows. Sounds complicated, right? Well, it probably is. Let’s say you have a table with 10,000,000,000 rows and the CalendarYear column is a CHAR(4). If there are only 25 different values for CalendarYear in the database, would you rather store 40,000,000,000 bytes of data or 100 bytes of data? I know which makes more sense to me.

Interestingly enough, there are two approaches being taken to solving this problem. The first is by creating single-purpose columnar databases. There are several vendors providing dedicated columnar databases. Other database developers are looking for ways to leverage their existing database engines and create hybrid row and columnar databases.

Looking Into the Future

There are a lot of interesting developments going on in the database world. Many of them seem to be happening outside of the big vendor, traditional database space. Most of this work is being done to solve a particular business need. These aren’t the traditional row-based OLTP systems that we’re all familiar with from the last 30 years of database development. These are new, special purpose, databases. It’s best to think of them like a sports car or even a race car – they get around the track very quickly, but they would be a poor choice for getting your groceries.

The next time you start a new project or plan a new server deployment, think about what functionality you need. Is it necessary to have full transactional support? Do you need a row-based store? How will you use the data?

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.