Tag MongoDB

Data Durability

A friend of mine half-jokingly says that the only reason to put data into a database is to get it back out again. In order to get data out, we need to ensure some kind of durability.

Relational databases offer single server durability through write-ahead logging and checkpoint mechanisms. These are tried and true methods of writing data to a replay log on disk as well as caching writes in memory. Whenever a checkpoint occurs, dirty data is flushed to disk. The benefit of a write ahead log is that we can always recover from a crash (so long as we have the log files, of course).

How does single server durability work with non-relational databases? Most of them don’t have write-ahead logging.

MongoDB currently has limited single server durability. While some people consider this a weakness, it has some strengths – writes complete very quickly since there is no write-ahead log that needs to immediately sync to disk. MongoDB also has the ability to create replica sets for increased durability. There is one obvious upside to replica sets – the data is in multiple places. Another advantage of replica sets is that it’s possible to use getLastError({w:...}) to request acknowledgement from multiple replica servers before a write is reported as complete to a client. Just keep in mind that getLastError is not used by default – application code will have to call the method to force the sync.

Setting a w-value for writes is something that was mentioned in Getting Faster Writes with Riak. Although, in that article we were decreasing durability to increase write performance. In Amazon Dynamo inspired systems writes are not considered complete until multiple clients have responded. The advantage is that durable replication is enforced at the database and clients have to elect to use less security for the data. Refer to the Cassandra documentation on Writes and Consistency or the Riak Replication documentation for more information on how Dynamo inspired replication works. Datastores using HDFS for storage can take advantage of HDFS’s built-in data replication.

Even HBase, a column-oriented database, uses HDFS to handle data replication. The trick is that rows may be chopped up based on columns and split into regions. Those regions are then distributed around the cluster on what are called region servers. HBase is designed for real-time read/write random-access. If we’re trying to get real-time reads and writes, we can’t expect HBase to immediately sync files to disk – there’s a commit log (RDBMS people will know this as a write-ahead log). Essentially, when a write comes in from a client, the write is first written to the commit log (which is stored using HDFS), then it’s written in memory and when the in-memory structure fills up, that structure is flushed to the filesystem. Here’s something cunning: since the commit log is being written to HDFS, it’s available in multiple places in the cluster at the same time. If one of the region servers goes down it’s easy enough to recover from – that region server’s commit log is split apart and distributed to other region servers which then take up the load of the failed region server.

There are plenty of HBase details that have been grossly oversimplified or blatantly ignored here for the sake of brevity. Additional details can be found in HBase Architecture 101 – Storage as well as this Advanced HBase presentation. As HBase is inspired by Google’s big table, additional information can be found in Chang et al. Bigtable: A distributed storage system for structured data and The Google File System.

Interestingly enough, there is a proposed feature for PostgreSQL 9.1 to add synchronous replication to PostgreSQL. Current replication in PostgreSQL is more like asynchronous database mirroring in SQL Server, or the default replica set write scenario with MongoDB. Synchronous replication makes it possible to ensure that data is being written to every node in the RDBMS cluster. Robert Haas discusses some of the pros and cons of replication in PostgreSQL in his post What Kind of Replication Do You Need?.

Microsoft’s Azure environment also has redundancy built in. Much like Hadoop, the redundancy and durability is baked into Azure at the filesystem. Building the redundancy at such a low level makes it easy for every component of the Azure environment to use it to achieve higher availability and durability. The Windows Azure Storage team have put together an excellent overview. Needless to say, Microsoft have implemented a very robust storage architecture for the Azure platform – binary data is split into chunks and spread across multiple servers. Each of those chunks is replicated so that there are three copies of the data at any given time. Future features will allow for data to be seamlessly geographically replicated.

Even SQL Azure, Microsoft’s cloud based relational database, takes advantage of this replication. In SQL Azure when a row is written in the database, the write occurs on three servers simultaneously. Clients don’t even see an operation as having committed until the filesystem has responded from all three locations. Automatic replication is designed into the framework. This prevents the loss of a single server, rack, or rack container from taking down a large number of customers. And, just like in other distributed systems, when a single node goes down, the load and data are moved to other nodes. For a local database, this kind of durability is typically only obtained using a combination of SAN technology, database replication, and database mirroring.

There is a lot of solid technology backing the Azure platform, but I suspect that part of Microsoft’s ultimate goal is to hide the complexity of configuring data durability from the user. It’s foreseeable that future upgrades will make it possible to dial up or down durability for storage.

While relational databases are finding more ways to spread load out and retain consistency, there are changes in store for MongoDB to improve single server durability. MongoDB has been highly criticized for its lack of single server durability. Until recently, the default response has been that you should take frequent backups and write to multiple replicas. This is still a good idea, but it’s promising to see that the MongoDB development team are addressing single server durability concerns.

Why is single server durability important for any database? Aside from guaranteeing that data is correct in the instance of a crash, it also makes it easier to increase adoption of a database at the department level. A durable single database server makes it easy to build an application on your desktop, deploy it to the server under your desk, and move it into the corporate data center as the application gains importance.

Logging and replication are critical technologies for databases. They guarantee data is durable and available. There are also just as many options as there are databases on the market. It’s important to understand the requirements of your application before choosing mechanisms to ensure durability and consistency across multiple servers.

References

Three Mistakes I Made With MongoDB

When I initially started working with MongoDB, it was very easy to get started. I could create schemas, create data, and pretty much do everything I wanted to do very quickly. As I started progressing through working with MongoDB, I started running into more and more problems. They weren’t problems with MongoDB, they were problems with the way I was thinking about MongoDB.

I Know How Stuff Works

The biggest mistake that I made was carrying over ideas about how databases work. When you work with any tool for a significant period of time, you start to make assumptions about how other features will work. Over time, your assumptions get more accurate. On the whole, it’s a good thing. It gets tricky when you start to switch around your frame of reference.

As a former DBA and alleged database expert, I’m pretty comfortable with relational databases. Once you understand some of the internals of one database, you can make some safe assumptions about how other databases have been implemented. One of the advantages of MongoDB is that it’s a heck of a lot like an RDBMS (sorry guys, that’s just the truth of it). It has collections, that look and act a lot like tables. There are indexes, there’s a query engine, there’s even replication. There are even more features and functionality that map really well between MongoDB and relational databases. It’s close enough that it’s painless to make a switch. The paradigms don’t match up exactly, but there are similarities.

Unfortunately for me, the paradigms and terminology matched up enough that I felt comfortable making a large number of assumptions. Boy was I ever wrong. I’ve had to stop thinking that I know anything about MongoDB and look up the answers to questions, rather than assume I know anything. It’s been frustrating, but it’s also been educational.

I Know How Data is Modeled

It’s really easy to make assumptions about modeling data. This is another area where I made huge assumptions that caused a lot of problems. In a relational database, we know how to model data. Normalization is really well understood. I know how to design database structures to take advantage of best practices and techniques. I know how to work with O/R-Ms and I understand where there are tradeoffs to be made between normalization, denormalization, and the software that talks to the database. We all learn these things as we progress in our careers. Once you get used to normalization, it’s easy to fall into that pattern.

Document database, like MongoDB, don’t work to their full potential when you normalize your data. Document databases, in my experience, do the opposite; they work best when the data is stored as a document. Document data is similar to the way data is used in the application child data is stored with a parent. An order’s line items are just a child collection of the order, there is no OrderHeader and OrderDetails table. Years of working within the same set of rules made it easy to slip into old habits.

Let’s say I have a number of user created documents in a Documents collection. Documents have an author as well as a set of tags created by the author. In a relational database, we’d have something like:

CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(30) NOT NULL,
    password VARCHAR(50) NOT NULL
);

CREATE TABLE documents (
    document_id INT PRIMARY KEY,
    title VARCHAR(30) NOT NULL,
    body VARCHAR(MAX) NOT NULL,
    user_id INT NOT NULL REFERENCES users(id)
);

CREATE TABLE tags (
   tag_id INT PRIMARY KEY,
   name VARCHAR(30) NOT NULL
);

CREATE TABLE document_tags (
   document_id INT NOT NULL,
   tag_id INT NOT NULL
);

For a lot of things, this makes perfect sense. With MongoDB, we’d have a documents collection and a users collection. We might have a separate tags collection as well, but that would be used as an inverted index for searching. The users collection would be used for validating logins and populating your user profile, but when a document is saved, we wouldn’t store a pointer to the appropriate record in the users collection. The appropriate thing to do would be to cache the data locally as well as store a pointer to the users collection. The same hold true for the document’s tags – why create a join construct between the two collections when we can store all of the tag data we need in the appropriate document? If we decide that we need to find documents by their tags we have two choices:

  1. Create an index on the document.tags property
  2. Create a tags inverted index.

If you’re interested in indexes, the MongoDB documentation on the subject is a great place to start. There is a specific kind of index called a multikey that allows you to index arrays of values. Inverted indexes are interested and I covered them while talking about building secondary indexes in Riak.

With a document database we’re trying to minimize the number of reads that we’re performing at any given time. A document should be a logical construct of whatever application entity we’re saving. A document would be a record of the document at the time it was saved – there will be cached information from the user, the document and its associated metadata, and a list of tags.

Of course, since we’re talking about data modeling, there are arguably n(n+1) ways to accomplish this and at least n+1 correct ways to accomplish this. If you really don’t like it, feel free to comment about it.

I Can Just Drop This In ###

Despite appearances and claims to the contrary, MongoDB is not a drop in replacement for an RDBMS. A relational database provides a phenomenal number of features for free – indexing, declarative referential integrity, transaction support, multi-version concurrency control, and multi-statement transactions to name a few. These features come with a price. Likewise, MongoDB provides a different set of features and they also have a price.

Believing the hype caused some sticky problems for me, not because I destroyed important data or anything like that, but because I assumed that I could use the same tools and tricks. I used MongoMapper to handle my database access. MongoMapper is a fine piece of code and it made a lot of things very easily. Using an O/R-M made it feel like I was using a relational database. It made things trickier, especially when I ran into situations where I was running into some of the blurry places I’ve already talked about. In hindsight, I should have used the stock drivers, built my own abstractions, and the replaced that when it became necessary. I don’t say that because I think I could do a better job, but because it makes more sense to build something from scratch yourself for the first time and then replace it when you’re building more plumbing than functionality.

Would I Use MongoDB Again?

Sure, if the project needed it. I don’t believe that MongoDB is a drop in replacement for an RDBMS. Thinking about MongoDB and RDBMSes that was does a disservice to both MongoDB and the RDBMS: they both have their own strengths and weaknesses.

Comparing MongoDB and SQL Server Replication

MongoDB has replication built in. So does SQL Server, Oracle, DB2, PostgreSQL, and MySQL. What’s the difference? What makes each MongoDB a unique and special snowflake?

I recently read a three part series on MongoDB repication (Replication Internals, Getting to Know Your Oplog, Bending the Oplog to Your Will) in an effort to better understand MongoDB’s replication compared to SQL Server’s replication.

Logging Sidebar

Before we get started, it’s important to distinguish between the oplog and MongoDB’s regular log. By default, MongoDB pipes its log to STDOUT… unless you supply the --logpath command line flag. Logging to STDOUT is fine for development, but you’ll want to make sure you log to a file for production use. The MongoDB log file is not like SQL Server’s log. It isn’t used for recovery playback. It’s an activity log. Sort of like the logs for your web server.

What’s The Same?

Both MongoDB and SQL Server store replicated data in a central repository. SQL Server stores transactions to be replicated in the distribution database. MongoDB stores replicated writes in the oplog collection. The most immediate difference between the two mechanisms is that SQL Server uses the transaction as the demarcation point while MongoDB uses the individual command as the demarcation point.

All of our transactions (MongoDB has transactions… they’re just only applied to a single command) are logged. That log is used to ship commands over to a subscriber. Both SQL Server and MongoDB support having multiple subscribers to a single database. In MongoDB, this is referred to as a replica set – every member of the set will receive all of commands from the master. MongoDB adds some additional features: any member of a replica set may be promoted to the master server if the original master server dies. This can be configured to happen automatically.

The Ouroboros

The Ouroboros is a mythical creature than devours its own tail. Like the Ouroboros, the MongoDB oplog devours its own tail. In ideal circumstances, this isn’t a problem. The oplog will happily write away. The replica servers will happily read away and, in general, keep up with the writing to the oplog.

The oplog file is a fixed size so, like the write ahead log in most RDBMSes, it will begin to eat itself again. This is fine… most of the time.

Unfortunately, if the replicas fall far enough behind, the oplog will overwrite the transactions that the replicas are reading. Yes, you read that correctly – your database will overwrite undistributed transactions. DBAs will most likely recoil in horror. Why is this bad? Well, under extreme circumstances you may have no integrity.

Let’s repeat that, just in case you missed it the first time:

There is no guarantee of replica integrity.

Now, before you put on your angry pants and look at SQL Server Books Online to prove me wrong, this is also entirely possible with transactional replication in SQL Server. It’s a little bit different, but the principle still applies. When you set up transactional replication in SQL Server, you also need to set up a retention period. If your replication is down for longer than X hours, SQL Server is going to tell you to cram it up your backside and rebuild your replication from scratch.

Falling Behind

Falling behind is easy to do when a server is under heavy load. But, since MongoDB avoids writing to disk to increase performance, that’s not a problem, right?

Theoretically yes. In reality that’s not always the case.

When servers are under a heavy load, a lot of weird things can happen. Heavy network traffic can result in TCP/IP offloading – the network card can offload work to the CPU. When you’re using commodity hardware with commodity storage, you might be using software RAID instead of hardware RAID to simulate one giant drive for data. Software RAID can be computationally expensive, especially if you encounter a situation where you start swapping to disk. Before you know it, you have a perfect storm of one off factors that have brought your shiny new server to its knees.

In the process, your oplog is happily writing away. The replica is falling further behind because you’re reading from your replica and writing to the master (that’s what we’re supposed to do, after all). Soon enough, your replicas are out of sync and you’ve lost data.

Falling Off a Cliff

Unfortunately, in this scenario, you might have problems recovering because the full resync also uses a circular oplog to determine where to start up replication again. The only way you could resolve this nightmare storm would be to shut down your forward facing application, kill incoming requests, and bring the database back online slowly and carefully.

Stopping I/O from incoming writes will make it easy for the replicas to catch up to the master and perform any shard reallocation that you need to split the load up more effectively.

Climbing Gear, Please

I’ve bitched a lot in this article about MongoDB’s replication. As a former DBA, it’s a scary model. But I’ve bitched a lot in the past about SQL Server’s transactional replication – logs can grow out of control if a subscriber falls behind or dies – but it happens with good reason. The SQL Sever dev team made the assumption that a replica should be consistent with the master. In order to keep a replica consistent, all of the undistributed commands need to be kept somewhere (in a log file) until all of the subscribers/replicas can be brought up to speed. This does result in a massive hit to your disk usage, but it also keeps your replicated databases in sync with the master.

Just like with MongoDB, there are times when a SQL Server subscriber may fall so far behind that you need to rebuild the replication. This is never an easy choice, no matter which platform you’re using, and it’s a decision that should not be taken lightly. MongoDB makes this choice a bit easier because MongoDB might very well eat its own oplog. Once that happens, you have no choice but to rebuild replication.

Replication is hard to administer and hard to get right. Be careful and proceed with caution, no matter what your platform.

At Least There is a Ladder

You can climb out of this hole and, realistically, it’s not that bad of a hole. In specific circumstances you may end up in a situation where you will have to take the front end application offline in order to resync your replicas. It’s not the best option, but at least there is a solution.

Every feature has a trade off. Relational databases trade integrity for performance (in this case) whereas MongoDB trades immediate performance for potential maintenance and recovery problems.

Further Reading

MongoDB

SQL Server

Foursquare and MongoDB: What If

People have chimed in and talked about the Foursquare outage. The nice part about these discussions is that they’re focusing on the technical problems with the current set up and Foursquare. They’re picking it apart and looking at what is right, what went wrong, and what needs to be done differently in MongoDB to prevent problems like this in the future.

Let’s play a “what if” game. What if Foursquare wasn’t using MongoDB? What if they were using something else?

Riak

Riak is a massively scaleable key/value data store. It’s based on Amazon’s Dynamo. If you don’t want to read the paper, that just means that it uses some magic to make sure that data is evenly spread throughout the database cluster and that adding a new node makes it very easy to rebalance the cluster.

What would have happened at Foursquare if they had been using Riak?

Riak still suffers from the same performance characteristics around disk access as MongoDB – once you have to page to disk, operations become slow and throughput dries up. This is a common problem in any software that needs to access disks – disks are slow, RAM is fast.

Riak, however, has an interesting distinction. It allocates keys inside the cluster using something called a consistent hash ring – this is just a convenient way to rapidly allocate ranges of keys to different nodes within a cluster. The ring itself isn’t interesting. What’s exciting is that the ring is divided into partitions (64 by default). Every node in the system claims an equal share of those 64 partitions. In addition, each partition is replicated so there are always three copies of a give key/value pair at any time. Because there are multiple copies of the data, it’s unlikely that any single node will fail or become unbalanced. In theory, if Foursquare had used Riak it is very unlikely that we’ll run into a problem were a single node becomes full.

How would this consistent hash ring magical design choice have helped? Adding a new node causes the cluster to redistribute data in the background. The new node will claim and equal amount of space in the cluster and the data will be redistributed from the other nodes in the background. The same thing happens when a node fails, by the way. Riak also only stores keys in memory, not all of the data. So it’s possible to reference an astronomical amount before running out of RAM.

There’s no need to worry about replica sets, re-sharding, or promoting a new node to master. Once a node joins a riak cluster, it takes over its share of the load. As long as you have the network throughput on your local network (which you probably do), then this operation can be fairly quick and painless.

Cassandra

Cassandra, like Riak, is based on Amazon’s Dynamo. It’s a massively distributed data store. I suspect that if Foursquare had used Cassandra, they would have run into similar problems.

Cassandra makes use of range partitioning to distribute data within the cluster. The Foursquare database was keyed off of the user name which, in turn, saw abnormal growth because some groups of users were more active than others. Names also tend to clump around certain letters, especially when you’re limited to a Latin character set. I know a lot more Daves that I know Zachariahs, and I have a lot of friends whose names start with the letter L. This distribution causes data within the cluster to be overly allocated to one node. This would, ultimately, lead to the same problems that happened at Foursquare with their MongoDB installation.

That being said, it’s possible to use a random partitioner for the data in Cassandra. The random partitioner makes it very easy to add additional nodes and distribute data across them. The random partitioner comes with a price. It makes it impossible to do quick range slice queries in Cassandra – you can no longer say “I want to see all of the data for January 3rd, 2010 through January 8th, 2010”. Instead, you would need to build up custom indexes to support your querying and build batch processes to load the indexes. The tradeoffs between the random partitioner and the order preserving partitioner are covered very well in Dominic Williams’s article Cassandra: RandomPartitioner vs OrderPreservingPartitioner.

Careful use of the random partitioner and supporting batch operations could have prevented the outage that Foursquare saw, but this would have lead to different design challenges, some of which may have been difficult to overcome without resorting to a great deal of custom code.

HBase/HDFS/Hadoop

HBase is a distributed column-oriented database built on top of HDFS. It is based on Google’s Bigtable database as described in “Bigtable: A Distributed Storage System for Structured Data”. As an implementation of Bigtable, HBase has a number of advantages over MongoDB – write ahead logging, rich ad hoc querying, and redundancy

HBase is not going to suffer from the same node redistribution problems that caused the Foursquare outage. When it comes time to add a new node to the cluster data will be migrated in much larger chunks, one data file at a time. This makes it much easier to add a new data node and redistribute data across the network.

Just like a relational database, HBase is designed so that all data doesn’t need to reside in memory for good performance. The internal data structures are built in a way that makes it very easy to find data and return it to the client. Keys are held in memory and two levels of caching make sure that frequently used data will be in memory when a client requests it.

HBase also has the advantage of using a write ahead log. In the event of a drive failure, it is possible to recover from a backup of your HBase database and play back the log to make sure data is correct and consistent.

If all of this sounds like HBase is designed to be a replacement for an RDBMS, you would be close. HBase is a massively distributed database, just like Bigtable. As a result, data needs to be logged because there is a chance that a hardware node will fail and will need to be recovered. Because HBase is a column-oriented database, we need to be careful not to treat it exactly like a relational database, but

A Relational Database

To be honest, Foursquare’s data load would be trivial in any relational database. SQL Server, Oracle, MySQL, and PostgreSQL can all handle orders of magnitude more data than the 132GB of data that Foursquare was storing at the time of the outage. This begs the question “How we could handle the constant write load?” Foursquare is a write-intensive application.

Typically, in the relational database world, when you need to scale read and write loads we add more disks. There is a finite amount of space in a server chassis and these local disks don’t provide the redundancy necessary for data security and performance; software RAID is also CPU intensive and slow. A better solution is to purchase a dedicated storage device, either a SAN, NAS, or DAS. All of these devices offer read/write caching and can be configured with in a variety of RAID levels for performance and redundancy.

RDBMSes are known quantities – they are easy to scale to certain points. Judging by the amount of data that Foursquare reports to have, they aren’t likely to reach the point where an RDBMS can no longer scale for a very long time. The downside to this approach is that an RDBMS is much costlier per TB of storage (up to ten times more expensive) than using MongoDB, but if your business is your data, then it’s important to keep the data safe.

Conclusion

It’s difficult to say if a different database solution would have prevented the Foursquare outage. But it is a good opportunity to highlight how different data storage systems would respond in the same situation.

MongoDB – Now With Sharding

Version 1.6 of MongoDB was released last week while I was at devLink. What makes this an important point release? Two things: sharding and replica sets.

Sharding is a way of partitioning data. Data will be sharded by a key – this could be based on zip code, customer number, SKU, or any other aspect of the data. Sharding data in MongoDB occurs on a collection by collection basis. If you want related data to be on the same server you’ll need to pick a sharding key that is present in every collection. However, this design decision make sense because some collections may grow faster than others. The point of sharding is, in part, to spread load across multiple servers.

What makes this important? Previously, sharding has been difficult to set up and administer and required custom application code to be written and maintained. More important, though, sharding gives MongoDB the ability to scale across multiple servers with minimal effort.

What would happen if a single node in a set of sharded servers got very busy? Well, MongoDB would detect that one of the nodes is growing faster than the others and it would start balancing the load across the other servers. This might seem like it would violate my earlier statement about how we set MongoDB up to using a sharding key that we define. Here’s the catch: MongoDB only uses that sharding key when we set things up and when there are no problems. If things start getting busy, it will make changes to the sharding key. Those changes get reported throughout the entire cluster of servers and everyone knows where their data is, although nobody outside of the cluster really needs to care.

Replica Sets

Replica sets are a new and improved way to perform data replication in MongoDB. Basically, we set up replication in a cluster of servers. If any single server fails, another server in the replica set will pick up the load. Once we’re able to get the dead server back up and running, the replica set will automatically start up a recovery process and our users will never know that there was an outage.

There can be only one master server at any given time, so this protects us from master server failures. Through the magic of network heartbeats, we can be aware of all of the servers in the replica set. Interestingly, the master server is determined by a priority setting that is assigned to each server. This way, we could use older hardware to serve as a backup (or read-only server) to the master and use faster machines in the replica set to take over from the master in the event of any kind of hardware failure.

How It Works

MongoDB Sharding Diagram

MongoDB Sharding Diagram

Basically, here’s what happens (if you want more details, please see the Sharding Introduction):

  1. The mongos server is a router that makes our complicated MongoDB set up look like a single server to the application.
  2. The mongod config servers maintain the shards. They know where data is stored and will attempt to balance the shards if any single node gets out of whack.
  3. Replica sets provide localized redundancy for each shard key.

Gotchas

There are a few things to be aware when you’re considering sharding with MongoDB:

  1. If a configuration server goes down, you can no longer reallocate data if any shards become write hot spots. This meta-data must be writeable for data to be repartitioned. You can still read and write data, but load will not be distributed.
  2. Choose sharding keys wisely. An overly broad sharding key will do you no good: all data can end up on one node and you will be unable to split the data onto multiple nodes.
  3. Some queries will use multiple shards – make sure you understand data distribution, querying patterns, and potential sharding keys.

Photo Credits

glass litter by psyberartist – Creative Commons Licensed
I thought I saw a puddy cat… by Keven Law – Creative Commons Licensed

MongoDB – Basic Querying

I put together a little video tutorial showing you how to accomplish some basic querying with MongoDB.

[media id=2 width=558 height=410]

Download the sample code. A QuickTime/H.264/iPhone formatted video is also available – download it now.

The Future of Databases

The Story So Far

I’ve been in love with data storage since I first opened up SQL*Plus and issued a select statement. The different ways to store data are fascinating. Even within a traditional OLTP database, there are a variety of design patterns that we can use to facilitate data access. We keep coming up with different ways to solve the problems what we’re facing in business. The problem is that as the field of computer science advances, and businesses increase in complexity, the ways that we store data must become more complex as well. Exponentially scaling storage complexity isn’t something that I like to think about, but it’s a distinct possibility.

General purpose OLTP databases are not the future of data storage and retrieval. They are a single piece in the puzzle. We’ve been working with OLTP systems for well over 20 years. OLAP is a newer entry, bringing specialized analytical tricks (which are counter intuitive to the way relational data is stored) to the masses. Hell, there are a number of general purpose analytical storage engines on the market. These general purpose analytical databases integrate well with existing databases and provide a complement to the transactional specialization of OLTP systems.

That’s the key, OLTP databases are purpose built transactional databases. They’re optimized for write operations because way back in the dark ages it was far more expensive to write data to disk than it was to read from disk. Data couldn’t be cached in memory because memory was scarce. Architectural decisions were made. The way that we design our databases is specifically designed to work within this structure. A well designed, normalized, database has minimal duplication of data. In OLTP systems this also serves to minimize the number of writes to disk when a common piece of data needs to be changed. I can remember when I was a kid and the United States Postal Service changed from using three letter state abbreviations to two letter abbreviations. I have to wonder what kind of difficulties this caused for many databases…

In the 40 years since E.F. Codd’s paper was published, the programming languages that we use have changed considerably. In 1970, COBOL was still relatively new. 1971 saw the the introduction of C, 1975 brought us MS Basic. 1979, 1980, and 1983 saw Ada, Smalltalk-80, Objective-C, and C++ ushering in a wave of object oriented languages. Suddenly programmers weren’t working on singular data points, they were working with a object that contained a collection of properties. The first ANSI SQL standard was codified in 1986. 1990 gave us Windows 3 and the desktop PC became more than a blinking cursor. The web exploded in 1996, 2001, and continues to explode again in a frenzy of drop shadows, bevels, mirror effects, and Flash.

Throughout the history of computing, we’ve been primarily working with tuples of data – attributes mapped to values; rows to you and I. This model holds up well when we’re working with a entity composed of a single tuple. What happens, though, when the entity becomes more complex? The model to retrieve and modify the entity becomes more complex as well. We can’t issue a simple update statement anymore, we have to go through more complex operations to make sure that the data is kept up to date.

Examples Should Make Things Clearer

Let’s take a look at something simple: my phone bill.

In the beginning…

Long ago, a phone bill was probably stored in a relatively simple format:

  • Account Number
  • Name
  • Address
  • Past Due Amount
  • Current Amount Due
  • Due Date

This was simple and it worked. Detailed records would be kept on printed pieces of paper in a big, smelly, damp basement where they could successfully grow mold and other assorted fungi. Whenever a customer had a dispute, a clerk would have to visit the records room and pull up the customer’s information. This was a manual process that probably involved a lot of letter writing, cursing, and typewriter ribbon.

Eventually, this simple bill format would prove to be unreliable (P.S. I’m totally making this up just to illustrate a point, but I’m guessing it went something like this). In our example, there’s no way to tell when a customer paid or which customer was billed.

After some tinkering…

After a few more iterations, you probably end up with a way of storing a customer’s information and bills that looks something like this:

This is a lot more complicated from both a design perspective and an implementation perspective. One of the things that makes this implementation more difficult is that there are a number of intermediate tables to work with and these tables can become hotspots for reads as well as writes.

When you look at that design, be honest with yourself and answer this question:

How often will you view a single service history or general charge row?

Think about your answer. The fact is, you probably won’t read any of those rows on its own. You might update one if a change comes in from an external source, but otherwise all of the charges, history, etc on any given phone bill will always be read as a unit. In this particular instance, we’re always consuming a bill’s entire graph) at once. Reading a bill into memory is an onerous prospect, not to mention that summarizing phone bills in this system is a read intensive operation.

Fixing the glitch

There are a lot of ways these problems could be worked around in a traditional OLTP database. However, that’s not the point. The point is that there are problems that require actual workarounds. OLTP databases work well for many use cases, but in this case an OLTP database becomes a problem because of the high cost of reading vs writing. (Why should we read-optimize a system that was designed to be write-optimized when writes will probably account for only 10% of our activity, maybe less?)

I’ve hinted at how we fix the glitch at the beginning of this article – we look for a specialized database. In our case, we can use something called a document database. The advantage of a document database is that we’re storing an organized collection of values in the database. This collection of values is similar to a traditional tabular database – we have groups of similar data stored in named collections. The distinction comes in how the data is accessed.

When we’re saving a phone bill, we don’t have to worry about calling multiple stored procedures or a single complex procedure. There’s no need to create complex mappings between a database and our code. We create an object or object graph in the application code and save it. The software that we use to connect to our document database knows how to properly translate our fancy objects into data stored on a disk somewhere.

This solution has several upsides:

  • Related data is stored in close proximity on disk
  • Documents do not require strict structure
  • Documents may change properties without requiring complex changes to physical schema

Physical Proximity

My data is close together, so what?

In a traditional OLTP database, your data may be scattered across one or multiple disk drives. Physical drive platters will have to spin to locate the data on different parts of your storage medium. Drive read/write arms will have to move around in coordination with the spinning platters. The more complex your query, the more complex the dance your physical hardware will have to do; a simple high school slow dance turns into a tango.

In a document database, all of our data is stored together in a single record. When we want to read our bill, we just have to start reading at the beginning of the bill record and stop when we come to the end. There’s no need to seek around on the disk.

You might be worried that all of your data won’t be close together on disk. And you’d be right. However, many databases (including MongoDB) allow for the creation of secondary indexes to speed up data retrieval. The biggest question you need to ask yourself is “How will the applications be accessing the data?” In many applications we’re only acting on a single object. Even when our application isn’t acting on a single object, we can pre-aggregate the data for faster reporting and retrieval. When our application only works on a single object at a time, a document database provides distinct advantages – every time we need an object, we’ll be pulling back all of the data we need in a single read operation.

Strict Structure

Databases typically require data to be rigidly structured. A table has a fixed set of columns. The datatypes, precision, and nullability can vary from column to column, but every row will have the same layout. Trying to store wildly diverse and variable data in a fixed storage medium is difficult.

Thankfully, document databases are well-suited to storing semi-structured data – since our data is a collection of attributes, it’s very easy to add or remove new attributes and change querying strategies rapidly and in response to different data structure. Better yet, document databases let us be ignorant of how the data is stored. If we want to find all bills where the account holder’s last name is ‘Smith’ and they live in Virginia but the bill doesn’t have any county tax charges, it is very easy compared to constructing the query in a typical SQL database.

Using MongoDB our query might look like:

db.bills.find( { last_name : 'Smith' }, 
               { state : 'Virginia' }, 
               { charges : { type : 'county tax', 
                             $exists : false } } )

Compared to similar SQL:

SELECT  b.*
FROM    bills b
        JOIN accounts a ON b.account_id = a.id
        LEFT JOIN charges c ON b.id = c.bill_id
                               AND c.type = 'county tax'
WHERE   a.last_name = 'Smith' 
        AND a.state = 'Virginia'
HAVING  COUNT(c.id) = 0

And right about now, every DBA that reads this blog is going to be shaking with rage and yelling “But that SQL is perfectly clear, I don’t know how you can expect me to understand all of those curly brackets!” I don’t expect you to understand those curly brackets. Nor do I expect developers to understand SQL. The easiest way for us to develop is to use our natural paradigm. That’s why developers write code in C#, PHP, or Ruby and DBAs do their work in some dialect of SQL. MongoDB alleviates this because all the developers are doing is constructing a list of keys and values that must be matched before a document can be returned.

Changing the Schema

Changing the schema of an OLTP database can be an onerous task. You have to wait for, or schedule, down time. Modifications have to take place. Of course, the schema modifications need to take into account any actions (like triggers or replication) that may occur in the background. This alone can require significant skill and internal database engine knowledge to write. It’s not something that application developers should be expected to know. Why do I mention application developers? 99 times out of 100, they’re the ones who are working on the database, not a dedicated DBA.

Many newer, non-traditional, databases make it incredibly easy to change the schema – just start writing the new attribute. The database itself takes care of the changes and will take that into account during querying. When a query is issued for a new attribute, records without that attribute will be ignored (just like a column with a NULL value in a traditional database).

What about Analytics?

I don’t know a lot about analytical databases, in part because they require a different skill set than the one I’ve developed. I do know a few things about them, though.

Analytical databases are currently encumbered by some of the same problems as OLTP databases – data is stored in tables made up of rows and columns. Sure, these are called dimensions/facts and attributes, but the premise is the same – it’s a row-based data store.

Row-based data stores pose particular problems for analytic databases. Analytic databases throw most of the rules about normalization in the garbage and instead duplicate data willy nilly. Without joins, it’s very quick to query and aggregate data. But the problem still remains that there is a large quantity of repeated data being stored on disk.

Columnar databases attempt to solve this problem by compressing columns with similar values and using some kind of magical method to link up columnar values with their respective rows. Sounds complicated, right? Well, it probably is. Let’s say you have a table with 10,000,000,000 rows and the CalendarYear column is a CHAR(4). If there are only 25 different values for CalendarYear in the database, would you rather store 40,000,000,000 bytes of data or 100 bytes of data? I know which makes more sense to me.

Interestingly enough, there are two approaches being taken to solving this problem. The first is by creating single-purpose columnar databases. There are several vendors providing dedicated columnar databases. Other database developers are looking for ways to leverage their existing database engines and create hybrid row and columnar databases.

Looking Into the Future

There are a lot of interesting developments going on in the database world. Many of them seem to be happening outside of the big vendor, traditional database space. Most of this work is being done to solve a particular business need. These aren’t the traditional row-based OLTP systems that we’re all familiar with from the last 30 years of database development. These are new, special purpose, databases. It’s best to think of them like a sports car or even a race car – they get around the track very quickly, but they would be a poor choice for getting your groceries.

The next time you start a new project or plan a new server deployment, think about what functionality you need. Is it necessary to have full transactional support? Do you need a row-based store? How will you use the data?

T-SQL Tuesday – “Did he just say that?” edition

As in, I didn’t participate in the most recent T-SQL Tuesday about my favorite feature in SQL 2008 R2.

Want to know my favorite 2008R2 features? PostgreSQL 9.0 and MongoDB.

PostgreSQL and MongoDB are rapidly advancing features that solve my daily problems. I’m an OLTP guy. Honestly, I don’t care about the latest reporting ding dongs and doo dads. I already know PowerShell. I manage 6 production servers and we’re unlikely to grow, so these MDM and Utility Control Points don’t make me giddy with excitement.

I solve problems using SQL.

You know what makes me happy? Support for window functions or better yet, improved support for window functions. What about exclusion constraints? Or column level triggers so I don’t have to use branching logic in my triggers? Yes, I use triggers. Or any other of these features.

What about MongoDB? I’ve just started playing with it, but it solves a lot of the problems I face at work. Not just in the day job, but in side projects as well. I’ve bitched about O/R-Ms before, but one of the biggest problems that users of O/R-Ms (developers) face is that the ideal way to model data for object-oriented programming bears no resemblance to the ideal way to store relational data. A recent article about scaling Farmville hints at this – the developers of Farmville managed scale by storing everything in a key-value store (memcached) before persisting to a relational data store later. Digg does something similar with Cassandra. It’s not like these guys are idiots, the blog posts from Digg show that they know their stuff.

MongoDB lets me solve these problems. I can control the frequency of syncs to disk (just as I can in PostgreSQL) to improve raw write performance to memory. I only have to worry about storing data the way my application expects to see the data – arrays and hashes – without worrying about building many-to-many join tables.

What about DMVs and data integrity and a write-ahead log and indexes? MongoDB has instrumentation and indexes. Yeah, you sacrifice some durability but many applications don’t need that. Hell, Amazon has even designed their systems to account for the potential of failure.

When I start my next application, I’m going to look long and hard at the platform I’m building on. There’s a good chance it’s not going to be a relational database and a really good chance it’s not going to be using SQL Server. It’s not because I have a problem with SQL Server or even RDBMSes, but because there are other choices that give me the flexibility I need to solve the problems I’m facing.

This is nothing against SQL Server 2008 R2, it’s a great step forward in the direction that SQL Server seems to be going. Sometimes I wonder if SQL Server and I are on the same road.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 531 items have been purified.