The Future of Databases

July 15, 2010

The Story So Far

I’ve been in love with data storage since I first opened up SQL*Plus and issued a select statement. The different ways to store data are fascinating. Even within a traditional OLTP database, there are a variety of design patterns that we can use to facilitate data access. We keep coming up with different ways to solve the problems what we’re facing in business. The problem is that as the field of computer science advances, and businesses increase in complexity, the ways that we store data must become more complex as well. Exponentially scaling storage complexity isn’t something that I like to think about, but it’s a distinct possibility. General purpose OLTP databases are not the future of data storage and retrieval. They are a single piece in the puzzle. We’ve been working with OLTP systems for well over 20 years. OLAP is a newer entry, bringing specialized analytical tricks (which are counter intuitive to the way relational data is stored) to the masses. Hell, there are a number of general purpose analytical storage engines on the market. These general purpose analytical databases integrate well with existing databases and provide a complement to the transactional specialization of OLTP systems. That’s the key, OLTP databases are purpose built transactional databases. They’re optimized for write operations because way back in the dark ages it was far more expensive to write data to disk than it was to read from disk. Data couldn’t be cached in memory because memory was scarce. Architectural decisions were made. The way that we design our databases is specifically designed to work within this structure. A well designed, normalized, database has minimal duplication of data. In OLTP systems this also serves to minimize the number of writes to disk when a common piece of data needs to be changed. I can remember when I was a kid and the United States Postal Service changed from using three letter state abbreviations to two letter abbreviations. I have to wonder what kind of difficulties this caused for many databases… In the 40 years since E.F. Codd’s paper was published, the programming languages that we use have changed considerably. In 1970, COBOL was still relatively new. 1971 saw the the introduction of C, 1975 brought us MS Basic. 1979, 1980, and 1983 saw Ada, Smalltalk-80, Objective-C, and C++ ushering in a wave of object oriented languages. Suddenly programmers weren’t working on singular data points, they were working with a object that contained a collection of properties. The first ANSI SQL standard was codified in 1986. 1990 gave us Windows 3 and the desktop PC became more than a blinking cursor. The web exploded in 1996, 2001, and continues to explode again in a frenzy of drop shadows, bevels, mirror effects, and Flash. Throughout the history of computing, we’ve been primarily working with tuples of data – attributes mapped to values; rows to you and I. This model holds up well when we’re working with a entity composed of a single tuple. What happens, though, when the entity becomes more complex? The model to retrieve and modify the entity becomes more complex as well. We can’t issue a simple update statement anymore, we have to go through more complex operations to make sure that the data is kept up to date.

Examples Should Make Things Clearer

Let’s take a look at something simple: my phone bill.

In the beginning…

Long ago, a phone bill was probably stored in a relatively simple format:

Account Number
Name
Address
Past Due Amount
Current Amount Due
Due Date

This was simple and it worked. Detailed records would be kept on printed pieces of paper in a big, smelly, damp basement where they could successfully grow mold and other assorted fungi. Whenever a customer had a dispute, a clerk would have to visit the records room and pull up the customer’s information. This was a manual process that probably involved a lot of letter writing, cursing, and typewriter ribbon. Eventually, this simple bill format would prove to be unreliable (P.S. I’m totally making this up just to illustrate a point, but I’m guessing it went something like this). In our example, there’s no way to tell when a customer paid or which customer was billed.

After some tinkering…

After a few more iterations, you probably end up with a way of storing a customer’s information and bills that looks something like this: This is a lot more complicated from both a design perspective and an implementation perspective. One of the things that makes this implementation more difficult is that there are a number of intermediate tables to work with and these tables can become hotspots for reads as well as writes. When you look at that design, be honest with yourself and answer this question:

How often will you view a single service history or general charge row?

Think about your answer. The fact is, you probably won’t read any of those rows on its own. You might update one if a change comes in from an external source, but otherwise all of the charges, history, etc on any given phone bill will always be read as a unit. In this particular instance, we’re always consuming a bill’s entire graph at once. Reading a bill into memory is an onerous prospect, not to mention that summarizing phone bills in this system is a read intensive operation.

Fixing the glitch

There are a lot of ways these problems could be worked around in a traditional OLTP database. However, that’s not the point. The point is that there are problems that require actual workarounds. OLTP databases work well for many use cases, but in this case an OLTP database becomes a problem because of the high cost of reading vs writing. (Why should we read-optimize a system that was designed to be write-optimized when writes will probably account for only 10% of our activity, maybe less?) I’ve hinted at how we fix the glitch at the beginning of this article – we look for a specialized database. In our case, we can use something called a document database. The advantage of a document database is that we’re storing an organized collection of values in the database. This collection of values is similar to a traditional tabular database – we have groups of similar data stored in named collections. The distinction comes in how the data is accessed. When we’re saving a phone bill, we don’t have to worry about calling multiple stored procedures or a single complex procedure. There’s no need to create complex mappings between a database and our code. We create an object or object graph in the application code and save it. The software that we use to connect to our document database knows how to properly translate our fancy objects into data stored on a disk somewhere. This solution has several upsides:

Related data is stored in close proximity on disk
Documents do not require strict structure
Documents may change properties without requiring complex changes to physical schema

Physical Proximity

My data is close together, so what? In a traditional OLTP database, your data may be scattered across one or multiple disk drives. Physical drive platters will have to spin to locate the data on different parts of your storage medium. Drive read/write arms will have to move around in coordination with the spinning platters. The more complex your query, the more complex the dance your physical hardware will have to do; a simple high school slow dance turns into a tango. In a document database, all of our data is stored together in a single record. When we want to read our bill, we just have to start reading at the beginning of the bill record and stop when we come to the end. There’s no need to seek around on the disk. You might be worried that all of your data won’t be close together on disk. And you’d be right. However, many databases (including MongoDB) allow for the creation of secondary indexes to speed up data retrieval. The biggest question you need to ask yourself is “How will the applications be accessing the data?” In many applications we’re only acting on a single object. Even when our application isn’t acting on a single object, we can pre-aggregate the data for faster reporting and retrieval. When our application only works on a single object at a time, a document database provides distinct advantages – every time we need an object, we’ll be pulling back all of the data we need in a single read operation.

Strict Structure

Databases typically require data to be rigidly structured. A table has a fixed set of columns. The datatypes, precision, and nullability can vary from column to column, but every row will have the same layout. Trying to store wildly diverse and variable data in a fixed storage medium is difficult. Thankfully, document databases are well-suited to storing semi-structured data – since our data is a collection of attributes, it’s very easy to add or remove new attributes and change querying strategies rapidly and in response to different data structure. Better yet, document databases let us be ignorant of how the data is stored. If we want to find all bills where the account holder’s last name is ‘Smith’ and they live in Virginia but the bill doesn’t have any county tax charges, it is very easy compared to constructing the query in a typical SQL database. Using MongoDB our query might look like:``` db.bills.find( { last_name : ‘Smith’ }, { state : ‘Virginia’ }, { charges : { type : ‘county tax’, $exists : false } } )

Compared to similar SQL: SELECT b.* FROM bills b JOIN accounts a ON b.account_id = a.id LEFT JOIN charges c ON b.id = c.bill_id AND c.type = ‘county tax’ WHERE a.last_name = ‘Smith’ AND a.state = ‘Virginia’ HAVING COUNT(c.id) = 0


#### Changing the Schema

Changing the schema of an OLTP database can be an onerous task. You have to wait for, or schedule, down time. Modifications have to take place. Of course, the schema modifications need to take into account any actions (like triggers or replication) that may occur in the background. This alone can require significant skill and internal database engine knowledge to write. It’s not something that application developers should be expected to know. Why do I mention application developers? 99 times out of 100, they’re the ones who are working on the database, not a dedicated DBA. Many newer, non-traditional, databases make it incredibly easy to change the schema – just start writing the new attribute. The database itself takes care of the changes and will take that into account during querying. When a query is issued for a new attribute, records without that attribute will be ignored (just like a column with a NULL value in a traditional database).

### What about Analytics?

I don’t know a lot about analytical databases, in part because they require a different skill set than the one I’ve developed. I do know a few things about them, though. Analytical databases are currently encumbered by some of the same problems as OLTP databases – data is stored in tables made up of rows and columns. Sure, these are called dimensions/facts and attributes, but the premise is the same – it’s a row-based data store. Row-based data stores pose particular problems for analytic databases. Analytic databases throw most of the rules about normalization in the garbage and instead duplicate data willy nilly. Without joins, it’s very quick to query and aggregate data. But the problem still remains that there is a large quantity of repeated data being stored on disk. [Columnar databases](http://wiki.toadforcloud.com/index.php/Survey_distributed_databases#Columnar_Databases) attempt to solve this problem by compressing columns with similar values and using some kind of magical method to link up columnar values with their respective rows. Sounds complicated, right? Well, it probably is. Let’s say you have a table with 10,000,000,000 rows and the CalendarYear column is a CHAR(4). If there are only 25 different values for CalendarYear in the database, would you rather store 40,000,000,000 bytes of data or 100 bytes of data? I know which makes more sense to me. Interestingly enough, there are two approaches being taken to solving this problem. The first is by creating single-purpose columnar databases. There are[several](http://www.sybase.com/products/datawarehousing/sybaseiq) [vendors](http://infinidb.org/) providing dedicated columnar databases. Other database developers are looking for ways to leverage their [existing database engines](http://blog.tapoueh.org/char10.html#sec10)and create hybrid row and columnar databases.

### Looking Into the Future

There are a lot of interesting developments going on in the database world. Many of them seem to be happening outside of the big vendor, traditional database space. Most of this work is being done to solve a particular business need. These aren’t the traditional row-based OLTP systems that we’re all familiar with from the last 30 years of database development. These are new, special purpose, databases. It’s best to think of them like a sports car or even a race car – they get around the track very quickly, but they would be a poor choice for getting your groceries. The next time you start a new project or plan a new server deployment, think about what functionality you need. Is it necessary to have full transactional support? Do you need a row-based store? How will you use the data?