Tag NoSQL

Building Our Own Federated Database

We’ve already talked about The Promise and Failure of Federated Databases and Why Don’t We Have Federated Databases. At the end of the second post I concluded that the only real way to solve this problem is to build the federated database ourselves. Before you ask, “Does he really want us to roll our own database” take a deep breath and relax; nobody is going to be writing a database.

What is a Federated Database?

When I was looking at the definition for a federated database, it dawned on me that a federated database doesn’t have to be an actual RDBMS, or any other type of existing DBMS. A federated database is a meta-database management system (or so Wikipedia claims). Looking at the other major explanation of a federated database we end up with “A federated database architecture is described in which a collection of independent database systems are united into a loosely coupled federation in order to share and exchange information.”

I’ve already talked about the implications of the first definition – it’s leading us down the path of a monolithic master server that must be aware of the other servers in the federation. New servers won’t be a part of the federation until we make the federating server aware of them. The other road, a loose collection of independent servers, is beginning to gain ground as companies bring more databases online in their data centers. When I say “more databases” I’m not just talking about a larger number of databases from a vendor, I’m also talking about databases from different vendors. Many people are exploring this route right now, some of them have attached the name of polyglot persistence to this approach.

Why Do I Want a Federated Database?

There are a couple of reasons that you’d want to roll your own federated database. I touched on them in the first part of this series: you may want to query across databases, you might have legacy systems, you may have merged with another company, or you might be using the most appropriate database for the job. Whatever you’re doing, you probably have a number of databases and you need to stitch them together.

Where Do I Get Started?

There are a number of ways we could go about creating a federated database. A lot of the ways to build a federated database solution are incredibly complex and involve creating meta-data databases as well as devising ways to link the databases together in an easily query-able way. I’m going to propose something different. Instead of designing something on your own, use the technology you already have and that your programming platform already comes with and understands: TCP/IP.

Nearly every programming language is capable of talking to other programs over TCP/IP. Instead of creating custom databases and worrying about meta-data management and cross server querying, create common services that answer common questions. Break your monolithic application down into manageable services and write those services using the most appropriate technology. Over my career, I’ve found that very few users need the ability to run ad hoc reports over the entire corporate data set. If users don’t need to be able to interactively query the entirety of their corporate data set, what do they need?

Almost all users need a small set of reports and data. Even when we expand the definition of “users” to include applications, services, APIs, and protocols, most activities are incredibly limited. Our users are asking the same sets of questions: How many accounts receivable have aged more than 30 days? What do the sales figures for the New England region look like for the last three years? Even when users are adding data to our databases they’re still performing a limited set of actions like saving an entire order, signing up for a new account, or adding a new accordion to their shopping cart. The activities that users perform data are very limited.

Knowing that users only perform a few activities with our live data, we can safely make some assumptions about the type of data access people will need. Keeping that in mind, it’s a lot easier to see how we can build our own federated database: we’re not going to. We’re going to build our own system using what many people call polyglot persistence.

Designing for Polyglot Persistence

The idea behind polyglot persistence is that we keep our data in the best database for storing that particular kind of data. Achieving this goal is achievable, but if that were the end game, it wouldn’t be the most useful goal for the business – business users want to see reports and combine data across applications and business units.

Going one step beyond the basics of polyglot persistence, we want to add another layer – a caching/service layer. It’s in this layer that we can start to really add rich functionality to the data that the business needs. Instead of having to replicate data across multiple data sources, we can query two separate servers and combine the data together before we return it to the client.

We’ve been doing this for years – it’s nothing new. The only thing that is remotely new is storing our data in the most suitable database. Well, that and telling our caching/service layer to cache as much data as possible while writing in the background. If we keep most of our data in cache, we don’t have to worry as much about write performance in the back end. We can queue writes to make sure they commit during idle times, we can spread them across many servers, and we can write to many reporting databases at once to make sure that reports are up to date. By moving application and reporting logic into an application and reporting tier, we free the database to focus on the tasks that databases excel at: storing and retrieving data. Complex logic and strange data mucking can be handled in the application layer by simple (or highly specialized) algorithms.

Polyglot persistence becomes incredibly valuable when we build mechanisms to load data from all of our disparate line of business systems into a single enterprise data warehouse. Once we have all of our data in a single warehouse, we’re able to write queries across business boundaries. The enterprise warehouse doesn’t need to be in a single monolithic RDBMS server; it could use Microsoft SQL Server Parallel Data Warehouse, Oracle ExaData, Postges-XC, HBase, Cassandra, or any other database that is up to the task.

Wrapping it up

Polyglot persistence seems to be the best answer to building a federated database. It doesn’t provide any kind of automated meta-data management or support for distributing queries automatically across many servers. Instead, polyglot persistence makes it easier to build a robust system that answers the questions business users both want and need while remaining fast and flexible. Is it the be all end all solution? No. Is it a step in the right direction? Yes.

Introduction to Riak … TONIGHT!

I’ll be speaking at the Columbus Ruby Brigade and giving an introduction to Riak tonight at 6:30PM!

There will be pizza and soda and Ruby and me. You can even stick around afterwards while we all go next door for drinks (you can buy my Diet Cokes all night if you really like the presentation).

Riak: An Overview

This presentation will lead you through an overview of Riak: a flexible, decentralized key-value store. Riak was designed to provide a friendly HTTP/JSON interface and provide a database that’s well suited for reliable web applications.

Add it to your calendar!

Introduction to Riak – Next Monday

I’ll be speaking at the Columbus Ruby Brigade and giving an introduction to Riak next Monday, February 21, at 6:30PM.

Riak: An Overview

This presentation will lead you through an overview of Riak: a flexible, decentralized key-value store. Riak was designed to provide a friendly HTTP/JSON interface and provide a database that’s well suited for reliable web applications.

Add it to your calendar!

Introduction to Riak at Columbus Ruby Brigade

I’ll be speaking at the Columbus Ruby Brigade and giving an introduction to Riak on February 21 at 6:30PM.

Riak: An Overview

This presentation will lead you through an overview of Riak: a flexible, decentralized key-value store. Riak was designed to provide a friendly HTTP/JSON interface and provide a database that’s well suited for reliable web applications.

Add it to your calendar!

Querying Riak – Key Filters and MapReduce

A while back we talked about getting faster writes with Riak. Since then, I’ve been quiet on the Riak front. Let’s take a look at how we can get data out of Riak, especially since I went to great pains to throw all of that data into Riak as fast as my little laptop could manage.

Key filtering is a new feature in Riak that makes it much easier to restrict queries to a subset of the data. Prior to Riak 0.13, it was necessary to write MapReduce jobs that would scan through all of the keys in a bucket. The problem is that the MapReduce jobs end up loading both the key and the value into memory. If we have a lot of data, this can cause a huge performance hit. Instead of loading all of the data key filtering lets us look at the keys themselves. We’re pre-processing the data before we get to our actually query. This is good because 1) software should do as little as possible and 2) Riak doesn’t have secondary indexing to make querying faster.

Here’s how it works: Riak holds all keys in memory, but the data remains on disk. The key filtering code scans the keys in memory on the nodes in our cluster. If any keys match our criteria, Riak will pass them along to any map phases that are waiting down the pipe. I’ve written the sample code in Ruby but this functionality is available through any client.

The Code

We’re using data loaded with load_animal_data.rb. The test script itself can be found in mr_filter.rb. Once again, we’re using the taxoboxes data set.

The Results

             user     system      total        real
mr       0.060000   0.030000   0.090000 ( 20.580278)
filter   0.000000   0.000000   0.000000 (  0.797387)

MapReduce

First, the MapReduce query:

{"inputs":"animals",
 "query":[{"map":{"language":"javascript",
                  "keep":false,
                  "source":"function(o) { if (o.key.indexOf('spider') != -1) return [1]; else return []; }"}},
          {"reduce":{"language":"javascript",
                     "keep":true,
                     "name":"Riak.reduceSum"}}]}

We’re going to iterate over every key value pair in the animals bucket and look for a key that contains the word ‘spider’. Once we find that key, we’re going to return a single element array containing the number 1. Once the map phase is done, we use the built-in function Riak.reduceSum to give us a sum of the values from the previous map phase. We’re generating a count of the records that match our data – how many spiders do we really have?

Key Filtering

The key filtering query doesn’t look that much different:

{"inputs":{"bucket":"animals",
           "key_filters":[["matches","spider"]]},
 "query":[{"map":{"language":"javascript",
                  "keep":false,
                  "source":"function(o) { return [1]; }"}},
          {"reduce":{"language":"javascript",
                     "keep":true,
                     "name":"Riak.reduceSum"}}]}

It’s not that much different – the map query has been greatly simplified to just return [1] on success and the search criteria has been moved into the inputs portion of the query. The big difference is in the performance: the key filter query is 26 times faster.

This is a simple example, but a 26x improvement is nothing to scoff at. What it really means is that the rest of our MapReduce needs to work on a smaller subset of the data which, ultimately, makes things faster for us.

A Different Way to Model Data

Now that we have our querying basics out of the way, let’s look at this problem from a different perspective; let’s say we’re tracking stock performance over time. In a relational database we might have a number of tables, notably a table to track stocks and a table to track daily_trade_volume. Theoretically, we could do the same thing in Riak with some success, but it would incur a lot of overhead. Instead we can use a natural key to locate our data. Depending on how we want to store the data, this could look something like YYYY-MM-DD-ticker_symbol. I’ve created a script to load data from stock exchange data. For my tests, I only loaded the data for stocks that began with Q. There’s an a lot of data in this data set, so I kept things to a minimum in order to make this quick.

Since our data also contains the stock exchange identifier, we could even go one step further and include the exchange in our key. That would be helpful if we were querying based on the exchange.

If you take a look at [mr_stocks.rb][8] you’ll see that we’re setting up a query to filter stocks by the symbol QTM and then aggregate the total trade volume by month. The map phase creates a single cell array with the stock volume traded in the month and returns it. We use the Riak.mapValuesJson function to map the raw data coming in from Riak to a proper JavaScript object. We then get the month that we’re looking at by parsing the key. This is easy enough to do because we have a well-defined key format.

function(o, keyData, arg) {
  var data = Riak.mapValuesJson(o)[0];
  var month = o.key.split('-').slice(0,2).join('-');
  var obj = {};
  obj[month] = data.stock_volume;
  return [ obj ];
}

If we were to look at this output we would see a lot of rows of unaggregated data. While that is interesting, we want to look at trending for stock trades for QTM over all time. To do this we create a reduce function that will sum up the output of the map function. This is some pretty self explanatory JavaScript:

function(values, arg) {
  return [ values.reduce(function(acc, item) {
               for (var month in item) {
                 if (acc[month]) { acc[month] += parseInt(item[month]); }
                 else { acc[month] = parseInt(item[month]); }
               }

               return acc;
             })
  ]
}

Okay, so that might not actually be as self-explanatory as anyone would like. The JavaScript reduce method is a newer one. It will accumulate a single result (the acc variable) for all elements in the array. You could use this to get a sum, an average, or whatever you want.

One other thing to note is that we use parseInt. We probably don’t have to use it, but it’s a good idea. Why? Riak is not aware of our data structures. We just store arrays of bytes in Riak – it could be a picture, it could be text, it could be a gzipped file – Riak doesn’t care. JavaScript only knows that it’s a string. So, when we want to do mathematical operations on our data, it’s probably wise to use parseInt and parseFloat.

Where to Now?

Right now you probably have a lot of data loaded. You have a couple of options. There are two scripts on github to remove the stock data and the animal data from your Riak cluster. That’s a pretty boring option. What can you learn from deleting your data and shutting down your Riak cluster? Not a whole lot.

You should open up mr_stocks.rb and take a look at how it works. It should be pretty easy to modify the map and reduce functions to output total trade volume for the month, average volume per day, and average price per day. Give it a shot and see what you come up with.

If you have questions or run into problems, you can hit up the comments, the Riak Developer Mailing List, or hit up the #riak IRC room on irc.freenode.net if you need immediate, real time help with your problem.

Data Durability

A friend of mine half-jokingly says that the only reason to put data into a database is to get it back out again. In order to get data out, we need to ensure some kind of durability.

Relational databases offer single server durability through write-ahead logging and checkpoint mechanisms. These are tried and true methods of writing data to a replay log on disk as well as caching writes in memory. Whenever a checkpoint occurs, dirty data is flushed to disk. The benefit of a write ahead log is that we can always recover from a crash (so long as we have the log files, of course).

How does single server durability work with non-relational databases? Most of them don’t have write-ahead logging.

MongoDB currently has limited single server durability. While some people consider this a weakness, it has some strengths – writes complete very quickly since there is no write-ahead log that needs to immediately sync to disk. MongoDB also has the ability to create replica sets for increased durability. There is one obvious upside to replica sets – the data is in multiple places. Another advantage of replica sets is that it’s possible to use getLastError({w:...}) to request acknowledgement from multiple replica servers before a write is reported as complete to a client. Just keep in mind that getLastError is not used by default – application code will have to call the method to force the sync.

Setting a w-value for writes is something that was mentioned in Getting Faster Writes with Riak. Although, in that article we were decreasing durability to increase write performance. In Amazon Dynamo inspired systems writes are not considered complete until multiple clients have responded. The advantage is that durable replication is enforced at the database and clients have to elect to use less security for the data. Refer to the Cassandra documentation on Writes and Consistency or the Riak Replication documentation for more information on how Dynamo inspired replication works. Datastores using HDFS for storage can take advantage of HDFS’s built-in data replication.

Even HBase, a column-oriented database, uses HDFS to handle data replication. The trick is that rows may be chopped up based on columns and split into regions. Those regions are then distributed around the cluster on what are called region servers. HBase is designed for real-time read/write random-access. If we’re trying to get real-time reads and writes, we can’t expect HBase to immediately sync files to disk – there’s a commit log (RDBMS people will know this as a write-ahead log). Essentially, when a write comes in from a client, the write is first written to the commit log (which is stored using HDFS), then it’s written in memory and when the in-memory structure fills up, that structure is flushed to the filesystem. Here’s something cunning: since the commit log is being written to HDFS, it’s available in multiple places in the cluster at the same time. If one of the region servers goes down it’s easy enough to recover from – that region server’s commit log is split apart and distributed to other region servers which then take up the load of the failed region server.

There are plenty of HBase details that have been grossly oversimplified or blatantly ignored here for the sake of brevity. Additional details can be found in HBase Architecture 101 – Storage as well as this Advanced HBase presentation. As HBase is inspired by Google’s big table, additional information can be found in Chang et al. Bigtable: A distributed storage system for structured data and The Google File System.

Interestingly enough, there is a proposed feature for PostgreSQL 9.1 to add synchronous replication to PostgreSQL. Current replication in PostgreSQL is more like asynchronous database mirroring in SQL Server, or the default replica set write scenario with MongoDB. Synchronous replication makes it possible to ensure that data is being written to every node in the RDBMS cluster. Robert Haas discusses some of the pros and cons of replication in PostgreSQL in his post What Kind of Replication Do You Need?.

Microsoft’s Azure environment also has redundancy built in. Much like Hadoop, the redundancy and durability is baked into Azure at the filesystem. Building the redundancy at such a low level makes it easy for every component of the Azure environment to use it to achieve higher availability and durability. The Windows Azure Storage team have put together an excellent overview. Needless to say, Microsoft have implemented a very robust storage architecture for the Azure platform – binary data is split into chunks and spread across multiple servers. Each of those chunks is replicated so that there are three copies of the data at any given time. Future features will allow for data to be seamlessly geographically replicated.

Even SQL Azure, Microsoft’s cloud based relational database, takes advantage of this replication. In SQL Azure when a row is written in the database, the write occurs on three servers simultaneously. Clients don’t even see an operation as having committed until the filesystem has responded from all three locations. Automatic replication is designed into the framework. This prevents the loss of a single server, rack, or rack container from taking down a large number of customers. And, just like in other distributed systems, when a single node goes down, the load and data are moved to other nodes. For a local database, this kind of durability is typically only obtained using a combination of SAN technology, database replication, and database mirroring.

There is a lot of solid technology backing the Azure platform, but I suspect that part of Microsoft’s ultimate goal is to hide the complexity of configuring data durability from the user. It’s foreseeable that future upgrades will make it possible to dial up or down durability for storage.

While relational databases are finding more ways to spread load out and retain consistency, there are changes in store for MongoDB to improve single server durability. MongoDB has been highly criticized for its lack of single server durability. Until recently, the default response has been that you should take frequent backups and write to multiple replicas. This is still a good idea, but it’s promising to see that the MongoDB development team are addressing single server durability concerns.

Why is single server durability important for any database? Aside from guaranteeing that data is correct in the instance of a crash, it also makes it easier to increase adoption of a database at the department level. A durable single database server makes it easy to build an application on your desktop, deploy it to the server under your desk, and move it into the corporate data center as the application gains importance.

Logging and replication are critical technologies for databases. They guarantee data is durable and available. There are also just as many options as there are databases on the market. It’s important to understand the requirements of your application before choosing mechanisms to ensure durability and consistency across multiple servers.

References

Goals for 2011 – Early Update

It’s a bit early to be updating my goals for 2011, but I’m really excited about this one. Over the course of last week, I wrote an article about loading data into Riak. I had a brief conversation with Mark Phillips (blog | twitter) about adding some of the code to the Riak function contrib.

This is where a sane person would say “Yeah, sure Mark, do whatever you want with my code.” Instead I said something like “I’d be happy to share. How about I make a generic tool?” About 40 minutes later I had a working chunk of code. 30 minutes after that I had refactored the code into a driver and a library. I wrote up some documentation and sent everything off to be included in the main Riak function contrib repository. A couple of days and a documentation correction later and you can now see my first code contribution to the open source world on the internet: Importing YAML.

While I’m really excited about this, and it’s very important to me, there’s more to take away from this than just “Yay, I did something!” We’re all able to give something back to our community. In this case I took code I had written to perform benchmarks and extracted a useful piece of demonstration code from it. Share your knowledge with the world around you – it’s how we get smarter.

Getting Faster Writes with Riak

While preparing for an upcoming presentation about Riak for the Columbus Ruby Brigade, I wrote a simple data loader. When I initially ran the load, it took about 4 minutes to load the data on the worst run. When you’re waiting to test your data load and write a presentation, 4 minutes is an eternity. Needless to say, I got frustrated pretty quickly with the speed of my data loader, so I hit up the Riak channel on IRC and started digging into the Ruby driver’s source code.

The Results

              user     system      total        real
defaults 63.660000   3.270000  66.930000 (166.475535)
dw => 1  50.940000   2.720000  53.660000 (128.470094)
dw => 0  52.350000   2.740000  55.090000 (120.151827)
n => 2   52.850000   2.790000  55.640000 (132.023310)

The Defaults

Our default load uses no customizations. Riak is going to write data to three nodes in the cluster (n = 3). Since we’re using the default configuration, we can safely assume that Riak will use quorum for write confirmation (w = n/2 + 1). Finally, we can also assume that the durable write value is to use a quorum, since that’s the default for riak-client.

Because we’re writing to n (3) nodes and we’re waiting for w (2) nodes to respond, writes were slower than I’d like. Thankfully, Riak makes it easy to tune how it will respond to writes.

Changing the N Value

The first change that we can do is change the N value (replication factor). The N value should have a huge improvement for my test machine – Riak is only on one of my hard drives. Even solid state drives can only write to one place at a time. When we create the bucket we can change the bucket’s properties and set the N value. note It’s important that you set bucket properties when you ‘create’ the bucket. Buckets are created when keys are added to them and they are deleted when the last key is deleted.

b1 = client.bucket('animals_dw1',
                   :keys => false)
b1.props = { :n_val => 1, :dw => 1 }

In this chunk of code we set the N value to 1 and set the durable writes to 1. This means that only 1 replica will have to commit the record to durable storage in order for the write to be considered a success.

On the bright side, this approach is considerably faster. Here’s the bummer: by setting the N value to 1, we’ve removed any hope of durability from our cluster – the data will never be replicated. Any server failure will result in data loss. For our testing purposes, it’s okay because we’re trying to see how fast we can make things, not how safe we can make them.

How much faster? Our run with all defaults enabled took 166 seconds. Only writing to 1 replica shaved 38 seconds off of our write time. The other thing that I changed was setting returnbody to false. By default, the Ruby Riak client will return the object that was saved. Turning this setting off should make things faster – less bytes are flying around the network.

Forget About Durability

What happens when we turn down durability? That’s the dw => 0 result in the table at the beginning of the article. We get an 8 second performance boost over our last load.

What did we change? We set both the dw and w parameters to 0. This means that our client has told Riak that we’re not going to wait for a response from any replicas before decided that a write has succeeded. This is a fire and forget write – we’re passing data as quickly as possible to the client and to hell with the consequences.

So, by eliminating any redundancy, ignoring the current record from the database, and refusing to acknowledge any reads from the server, we’re able to get a 46.3 second performance improvement over our default values. This is impressive, but it’s roughly akin to throwing our data at a bucket, not into the bucket.

What if I Care About My Data?

What if you care about your data? After all, we got our performance improvement from setting the number of replicas to 1 and turning off write acknowledgement. The fourth, and final run, that I performed took a look at what would happen if we kept the number of replicas at a quorum (an N value of 2) and ignored write responses. If we’re just streaming data into a database, we may not care if a record gets missed here and there. It turns out that this is only slightly slower than running with scissors. It takes 132 seconds to write the data; only 4 seconds slower than with durable writes set to 1 and still nearly 34.5 seconds faster than using the defaults.

The most recent version of this sample code can be found on github at https://github.com/peschkaj/riak_intro. Sample data was located through Infochimps.com. The data load uses the Taxobox data set.

A Technical Plan for 2011

The Last Ten Years

My career has been particularly interesting. I’ve been very fortunate to work with a variety of different languages, platforms, databases, frameworks, and people. I started off working with Perl on HP-UX. As I started automating more of my job, I added ASP.NET to the mix. Eventually I learned about databases, first with Oracle, then SQL Server, then with PostgreSQL, and finally back to SQL Server. Along the way I’ve held job a variety of different job titles – system administrator, system engineer, developer, consultant, architect, and database administrator.

I’ve worked with a lot of different systems, architectures, and design philosophies. The one thing that’s stuck with me is that there is no one size fits all answer. That extends beyond languages and design patterns – it goes right down to the way we’re storing data. One of the most interesting things going on right now is that it’s easier than ever to pick the right tools for the job.

The Next Twelve Months

Over the next twelves months, I’m going to be digging into hybrid database solutions. Some people call it polyglot persistence. You can call it what you like, but the fact remains that it is no longer necessary to store all of our data in a relational database. Frankly, I’m encouraging people to look into ways to store their data outside of a relational database. Not because RDBMSes are bad or wrong, but because there is a lot of data that doesn’t need to be in a relational database – session data, cached data, and flexible data.

Why Focus on Hybrid Data?

The idea behind hybrid data is that we use multiple databases instead of one database. Let’s say that we have an online store where we sell musical equipment. We want to store customer data in a relational database, but where should we store the rest of our information? Conventional thinking says that we should keep storing our data in a relational database. Sessions might be stored in memory somewhere and shopping carts might get stored in the database, but they’ll end up on faster, more expensive, solid state drives.

There are other ways to store data.

Let’s think about all of this for a minute. Why do we force our databases into existing paradigms? Why aren’t we thinking about new and interesting ways to store our data?

Sessions are a great place to start. Sure, we could use something like memcached, but why not examine Redis or App Fabric Cache? Both of these databases have support for strongly typed data. They both allow the data to be persisted to disk, if needed, and they allow for data to be expired over time. This is perfect for working with any kind of cached data – it stays in a format our applications need but we can expire it or save it as needed.

The flexibility to store our data the way that applications use it is important. Session data should be rapidly accessible. Other applications don’t need to read it. It doesn’t need to be reportable. It merely needs to be fast.

Shopping carts are different. Amazon’s own use cases and needs drove the development of Dynamo to be a durable, eventually consistent, distributed key-value store. Shopping carts are write heavy environments. It’s rare that users need to view everything that’s in a shopping car, but they need to be able to review it quickly when the time comes. Likewise, when the time comes to review a shopping cart, any delay or slowdown means there’s a chance the user will simply abandon the cart. Dynamo fills these requirements quite well.

Since Dynamo is only available inside of Amazon, how are we supposed to work with it ourselves? Riak is a clone of Dynamo that meets our need for a shopping cart. It’s a key/value database; it’s fault tolerant, and it’s fast.

Why not store a shopping cart in a relational database? It is, after all, a pretty simple collection of a user identifier, an item number, an item description, price, and quantity. Shopping carts are highly transient. Once an order has been placed, the shopping cart is cleared out and the data in the cart moves into the ordering system. Most shopping carts will be active for a very short period of time – a matter of minutes at most. Over their short lives shopping carts will almost entirely be written to and only read a few times. Instead of building complex sharding mechanisms to spread load out across a number of database servers, why not use a database designed to handle large load spread across a number of servers?

Where Does This Fit Into the Enterprise?

Enterprises should be adopting these technologies as fast as they can. Not because they are replacing the relational database, but because they free the relation database from things it’s bad at and leave it to perform tasks that it excels at. Relational databases are great for core business logic – they have a lot of baked in functionality like data integrity and validation. As we’ve already discussed, relational databases are not well suited to storing highly volatile data.

By moving volatile data into better suited types of database, enterprises can increase the capacity of their database systems, provide redundancy, and increase scalability by using off the shelf solutions. The trick, of course, lies in integrating them. And that is what I’m going to be playing around with this year.

Three Mistakes I Made With MongoDB

When I initially started working with MongoDB, it was very easy to get started. I could create schemas, create data, and pretty much do everything I wanted to do very quickly. As I started progressing through working with MongoDB, I started running into more and more problems. They weren’t problems with MongoDB, they were problems with the way I was thinking about MongoDB.

I Know How Stuff Works

The biggest mistake that I made was carrying over ideas about how databases work. When you work with any tool for a significant period of time, you start to make assumptions about how other features will work. Over time, your assumptions get more accurate. On the whole, it’s a good thing. It gets tricky when you start to switch around your frame of reference.

As a former DBA and alleged database expert, I’m pretty comfortable with relational databases. Once you understand some of the internals of one database, you can make some safe assumptions about how other databases have been implemented. One of the advantages of MongoDB is that it’s a heck of a lot like an RDBMS (sorry guys, that’s just the truth of it). It has collections, that look and act a lot like tables. There are indexes, there’s a query engine, there’s even replication. There are even more features and functionality that map really well between MongoDB and relational databases. It’s close enough that it’s painless to make a switch. The paradigms don’t match up exactly, but there are similarities.

Unfortunately for me, the paradigms and terminology matched up enough that I felt comfortable making a large number of assumptions. Boy was I ever wrong. I’ve had to stop thinking that I know anything about MongoDB and look up the answers to questions, rather than assume I know anything. It’s been frustrating, but it’s also been educational.

I Know How Data is Modeled

It’s really easy to make assumptions about modeling data. This is another area where I made huge assumptions that caused a lot of problems. In a relational database, we know how to model data. Normalization is really well understood. I know how to design database structures to take advantage of best practices and techniques. I know how to work with O/R-Ms and I understand where there are tradeoffs to be made between normalization, denormalization, and the software that talks to the database. We all learn these things as we progress in our careers. Once you get used to normalization, it’s easy to fall into that pattern.

Document database, like MongoDB, don’t work to their full potential when you normalize your data. Document databases, in my experience, do the opposite; they work best when the data is stored as a document. Document data is similar to the way data is used in the application child data is stored with a parent. An order’s line items are just a child collection of the order, there is no OrderHeader and OrderDetails table. Years of working within the same set of rules made it easy to slip into old habits.

Let’s say I have a number of user created documents in a Documents collection. Documents have an author as well as a set of tags created by the author. In a relational database, we’d have something like:

CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(30) NOT NULL,
    password VARCHAR(50) NOT NULL
);

CREATE TABLE documents (
    document_id INT PRIMARY KEY,
    title VARCHAR(30) NOT NULL,
    body VARCHAR(MAX) NOT NULL,
    user_id INT NOT NULL REFERENCES users(id)
);

CREATE TABLE tags (
   tag_id INT PRIMARY KEY,
   name VARCHAR(30) NOT NULL
);

CREATE TABLE document_tags (
   document_id INT NOT NULL,
   tag_id INT NOT NULL
);

For a lot of things, this makes perfect sense. With MongoDB, we’d have a documents collection and a users collection. We might have a separate tags collection as well, but that would be used as an inverted index for searching. The users collection would be used for validating logins and populating your user profile, but when a document is saved, we wouldn’t store a pointer to the appropriate record in the users collection. The appropriate thing to do would be to cache the data locally as well as store a pointer to the users collection. The same hold true for the document’s tags – why create a join construct between the two collections when we can store all of the tag data we need in the appropriate document? If we decide that we need to find documents by their tags we have two choices:

  1. Create an index on the document.tags property
  2. Create a tags inverted index.

If you’re interested in indexes, the MongoDB documentation on the subject is a great place to start. There is a specific kind of index called a multikey that allows you to index arrays of values. Inverted indexes are interested and I covered them while talking about building secondary indexes in Riak.

With a document database we’re trying to minimize the number of reads that we’re performing at any given time. A document should be a logical construct of whatever application entity we’re saving. A document would be a record of the document at the time it was saved – there will be cached information from the user, the document and its associated metadata, and a list of tags.

Of course, since we’re talking about data modeling, there are arguably n(n+1) ways to accomplish this and at least n+1 correct ways to accomplish this. If you really don’t like it, feel free to comment about it.

I Can Just Drop This In ###

Despite appearances and claims to the contrary, MongoDB is not a drop in replacement for an RDBMS. A relational database provides a phenomenal number of features for free – indexing, declarative referential integrity, transaction support, multi-version concurrency control, and multi-statement transactions to name a few. These features come with a price. Likewise, MongoDB provides a different set of features and they also have a price.

Believing the hype caused some sticky problems for me, not because I destroyed important data or anything like that, but because I assumed that I could use the same tools and tricks. I used MongoMapper to handle my database access. MongoMapper is a fine piece of code and it made a lot of things very easily. Using an O/R-M made it feel like I was using a relational database. It made things trickier, especially when I ran into situations where I was running into some of the blurry places I’ve already talked about. In hindsight, I should have used the stock drivers, built my own abstractions, and the replaced that when it became necessary. I don’t say that because I think I could do a better job, but because it makes more sense to build something from scratch yourself for the first time and then replace it when you’re building more plumbing than functionality.

Would I Use MongoDB Again?

Sure, if the project needed it. I don’t believe that MongoDB is a drop in replacement for an RDBMS. Thinking about MongoDB and RDBMSes that was does a disservice to both MongoDB and the RDBMS: they both have their own strengths and weaknesses.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.