Three Mistakes I Made With MongoDB

January 06, 2011

When I initially started working with MongoDB, it was very easy to get started. I could create schemas, create data, and pretty much do everything I wanted to do very quickly. As I started progressing through working with MongoDB, I started running into more and more problems. They weren’t problems with MongoDB, they were problems with the way I was thinking about MongoDB.

I Know How Stuff Works

The biggest mistake that I made was carrying over ideas about how databases work. When you work with any tool for a significant period of time, you start to make assumptions about how other features will work. Over time, your assumptions get more accurate. On the whole, it’s a good thing. It gets tricky when you start to switch around your frame of reference. As a former DBA and alleged database expert, I’m pretty comfortable with relational databases. Once you understand some of the internals of one database, you can make some safe assumptions about how other databases have been implemented. One of the advantages of MongoDB is that it’s a heck of a lot like an RDBMS (sorry guys, that’s just the truth of it). It has collections, that look and act a lot like tables. There are indexes, there’s a query engine, there’s even replication. There are even more features and functionality that map really well between MongoDB and relational databases. It’s close enough that it’s painless to make a switch. The paradigms don’t match up exactly, but there are similarities. Unfortunately for me, the paradigms and terminology matched up enough that I felt comfortable making a large number of assumptions. Boy was I ever wrong. I’ve had to stop thinking that I know anything about MongoDB and look up the answers to questions, rather than assume I know anything. It’s been frustrating, but it’s also been educational.

I Know How Data is Modeled

It’s really easy to make assumptions about modeling data. This is another area where I made huge assumptions that caused a lot of problems. In a relational database, we know how to model data. Normalization is really well understood. I know how to design database structures to take advantage of best practices and techniques. I know how to work with O/R-Ms and I understand where there are tradeoffs to be made between normalization, denormalization, and the software that talks to the database. We all learn these things as we progress in our careers. Once you get used to normalization, it’s easy to fall into that pattern. Document database, like MongoDB, don’t work to their full potential when you normalize your data. Document databases, in my experience, do the opposite; they work best when the data is stored as a document. Document data is similar to the way data is used in the application child data is stored with a parent. An order’s line items are just a child collection of the order, there is noOrderHeader and OrderDetails table. Years of working within the same set of rules made it easy to slip into old habits. Let’s say I have a number of user created documents in a Documentscollection. Documents have an author as well as a set of tags created by the author. In a relational database, we’d have something like:``` CREATE TABLE users ( user_id INT PRIMARY KEY, username VARCHAR(30) NOT NULL, password VARCHAR(50) NOT NULL );

CREATE TABLE documents ( document_id INT PRIMARY KEY, title VARCHAR(30) NOT NULL, body VARCHAR(MAX) NOT NULL, user_id INT NOT NULL REFERENCES users(id) );

CREATE TABLE tags ( tag_id INT PRIMARY KEY, name VARCHAR(30) NOT NULL );

CREATE TABLE document_tags ( document_id INT NOT NULL, tag_id INT NOT NULL ); ```For a lot of things, this makes perfect sense. With MongoDB, we’d have adocuments collection and a users collection. We might have a separatetags collection as well, but that would be used as an inverted index for searching. The users collection would be used for validating logins and populating your user profile, but when a document is saved, we wouldn’t store a pointer to the appropriate record in the users collection. The appropriate thing to do would be to cache the data locally as well as store a pointer to theusers collection. The same hold true for the document’s tags – why create a join construct between the two collections when we can store all of the tag data we need in the appropriate document? If we decide that we need to find documents by their tags we have two choices:

Create an index on the document.tags property
Create a tags inverted index.

If you’re interested in indexes, the MongoDB documentation on the subject is a great place to start. There is a specific kind of index called a multikey that allows you to index arrays of values. Inverted indexes are interested and I covered them while talking about building secondary indexes in Riak.

With a document database we’re trying to minimize the number of reads that we’re performing at any given time. A document should be a logical construct of whatever application entity we’re saving. A document would be a record of the document at the time it was saved – there will be cached information from the user, the document and its associated metadata, and a list of tags. Of course, since we’re talking about data modeling, there are arguably n(n+1) ways to accomplish this and at least n+1 correct ways to accomplish this. If you really don’t like it, feel free to comment about it.

I Can Just Drop This In

Despite appearances and claims to the contrary, MongoDB is not a drop in replacement for an RDBMS. A relational database provides a phenomenal number of features for free – indexing, declarative referential integrity, transaction support, multi-version concurrency control, and multi-statement transactions to name a few. These features come with a price. Likewise, MongoDB provides a different set of features and they also have a price. Believing the hype caused some sticky problems for me, not because I destroyed important data or anything like that, but because I assumed that I could use the same tools and tricks. I used MongoMapper to handle my database access. MongoMapper is a fine piece of code and it made a lot of things very easily. Using an O/R-M made it feel like I was using a relational database. It made things trickier, especially when I ran into situations where I was running into some of the blurry places I’ve already talked about. In hindsight, I should have used the stock drivers, built my own abstractions, and the replaced that when it became necessary. I don’t say that because I think I could do a better job, but because it makes more sense to build something from scratch yourself for the first time and then replace it when you’re building more plumbing than functionality.

Would I Use MongoDB Again?

Sure, if the project needed it. I don’t believe that MongoDB is a drop in replacement for an RDBMS. Thinking about MongoDB and RDBMSes that was does a disservice to both MongoDB and the RDBMS: they both have their own strengths and weaknesses.