Foursquare and MongoDB: What If

People have chimed in and talked about the Foursquare outage. The nice part about these discussions is that they’re focusing on the technical problems with the current set up and Foursquare. They’re picking it apart and looking at what is right, what went wrong, and what needs to be done differently in MongoDB to prevent problems like this in the future.

Let’s play a “what if” game. What if Foursquare wasn’t using MongoDB? What if they were using something else?

Riak

Riak is a massively scaleable key/value data store. It’s based on Amazon’s Dynamo. If you don’t want to read the paper, that just means that it uses some magic to make sure that data is evenly spread throughout the database cluster and that adding a new node makes it very easy to rebalance the cluster.

What would have happened at Foursquare if they had been using Riak?

Riak still suffers from the same performance characteristics around disk access as MongoDB – once you have to page to disk, operations become slow and throughput dries up. This is a common problem in any software that needs to access disks – disks are slow, RAM is fast.

Riak, however, has an interesting distinction. It allocates keys inside the cluster using something called a consistent hash ring – this is just a convenient way to rapidly allocate ranges of keys to different nodes within a cluster. The ring itself isn’t interesting. What’s exciting is that the ring is divided into partitions (64 by default). Every node in the system claims an equal share of those 64 partitions. In addition, each partition is replicated so there are always three copies of a give key/value pair at any time. Because there are multiple copies of the data, it’s unlikely that any single node will fail or become unbalanced. In theory, if Foursquare had used Riak it is very unlikely that we’ll run into a problem were a single node becomes full.

How would this consistent hash ring magical design choice have helped? Adding a new node causes the cluster to redistribute data in the background. The new node will claim and equal amount of space in the cluster and the data will be redistributed from the other nodes in the background. The same thing happens when a node fails, by the way. Riak also only stores keys in memory, not all of the data. So it’s possible to reference an astronomical amount before running out of RAM.

There’s no need to worry about replica sets, re-sharding, or promoting a new node to master. Once a node joins a riak cluster, it takes over its share of the load. As long as you have the network throughput on your local network (which you probably do), then this operation can be fairly quick and painless.

Cassandra

Cassandra, like Riak, is based on Amazon’s Dynamo. It’s a massively distributed data store. I suspect that if Foursquare had used Cassandra, they would have run into similar problems.

Cassandra makes use of range partitioning to distribute data within the cluster. The Foursquare database was keyed off of the user name which, in turn, saw abnormal growth because some groups of users were more active than others. Names also tend to clump around certain letters, especially when you’re limited to a Latin character set. I know a lot more Daves that I know Zachariahs, and I have a lot of friends whose names start with the letter L. This distribution causes data within the cluster to be overly allocated to one node. This would, ultimately, lead to the same problems that happened at Foursquare with their MongoDB installation.

That being said, it’s possible to use a random partitioner for the data in Cassandra. The random partitioner makes it very easy to add additional nodes and distribute data across them. The random partitioner comes with a price. It makes it impossible to do quick range slice queries in Cassandra – you can no longer say “I want to see all of the data for January 3rd, 2010 through January 8th, 2010”. Instead, you would need to build up custom indexes to support your querying and build batch processes to load the indexes. The tradeoffs between the random partitioner and the order preserving partitioner are covered very well in Dominic Williams’s article Cassandra: RandomPartitioner vs OrderPreservingPartitioner.

Careful use of the random partitioner and supporting batch operations could have prevented the outage that Foursquare saw, but this would have lead to different design challenges, some of which may have been difficult to overcome without resorting to a great deal of custom code.

HBase/HDFS/Hadoop

HBase is a distributed column-oriented database built on top of HDFS. It is based on Google’s Bigtable database as described in “Bigtable: A Distributed Storage System for Structured Data”. As an implementation of Bigtable, HBase has a number of advantages over MongoDB – write ahead logging, rich ad hoc querying, and redundancy

HBase is not going to suffer from the same node redistribution problems that caused the Foursquare outage. When it comes time to add a new node to the cluster data will be migrated in much larger chunks, one data file at a time. This makes it much easier to add a new data node and redistribute data across the network.

Just like a relational database, HBase is designed so that all data doesn’t need to reside in memory for good performance. The internal data structures are built in a way that makes it very easy to find data and return it to the client. Keys are held in memory and two levels of caching make sure that frequently used data will be in memory when a client requests it.

HBase also has the advantage of using a write ahead log. In the event of a drive failure, it is possible to recover from a backup of your HBase database and play back the log to make sure data is correct and consistent.

If all of this sounds like HBase is designed to be a replacement for an RDBMS, you would be close. HBase is a massively distributed database, just like Bigtable. As a result, data needs to be logged because there is a chance that a hardware node will fail and will need to be recovered. Because HBase is a column-oriented database, we need to be careful not to treat it exactly like a relational database, but

A Relational Database

To be honest, Foursquare’s data load would be trivial in any relational database. SQL Server, Oracle, MySQL, and PostgreSQL can all handle orders of magnitude more data than the 132GB of data that Foursquare was storing at the time of the outage. This begs the question “How we could handle the constant write load?” Foursquare is a write-intensive application.

Typically, in the relational database world, when you need to scale read and write loads we add more disks. There is a finite amount of space in a server chassis and these local disks don’t provide the redundancy necessary for data security and performance; software RAID is also CPU intensive and slow. A better solution is to purchase a dedicated storage device, either a SAN, NAS, or DAS. All of these devices offer read/write caching and can be configured with in a variety of RAID levels for performance and redundancy.

RDBMSes are known quantities – they are easy to scale to certain points. Judging by the amount of data that Foursquare reports to have, they aren’t likely to reach the point where an RDBMS can no longer scale for a very long time. The downside to this approach is that an RDBMS is much costlier per TB of storage (up to ten times more expensive) than using MongoDB, but if your business is your data, then it’s important to keep the data safe.

Conclusion

It’s difficult to say if a different database solution would have prevented the Foursquare outage. But it is a good opportunity to highlight how different data storage systems would respond in the same situation.

Comments

19 Comments so far. Comments are closed.
  1. Thanks for the article. Its really neat seeing the differences in all the flavors and choices we have.

  2. Nitin,

    Awesome write up of comparison who has seen/experienced several NOSQL solutions.

  3. Daniel Einspanjer,

    Nice write up. One important thing that I’d note with both Riak and Cassandra is that having a node be down for an extended period of time can cause a build-up on other nodes due to the hinted-handoff mechanism. If unchecked, this could cause a cascading failure of the cluster.

  4. Pete Austin,

    When a RDBMS gets very large, performance is not limited by writes but by the work needed to defrag/rebuild indexes so they do not get too inefficient. This is a challenge if you cannot allow for any downtime. I think there will be similar considerations for the noSQL approaches.

    • Absolutely! RDBMSes do run into problems when you scale your indexes to large sizes. Most of my experience has been with SQL Server. SQL Server Enterprise Edition (which is a pricey monster) gives you the ability to perform index rebuilds online – the old index remains available for queries until the index is rebuilt. I would guess that Oracle has something similar.

      One of the biggest problems of scaling an RDBMS comes from the cost of scaling – licensing gets very expensive at scale.

  5. Mark,

    Hey Jeremiah

    Good write up. One thing I wanted to point out: “ring buffer” is both the wrong term and concept to use when talking about Riak’s architecture. The use of the consistent hash ring is how Riak distributes data and replicas around a cluster.

    See -> http://wiki.basho.com/display/RIAK/An+Introduction+to+Riak#AnIntroductiontoRiak
    and -> http://wiki.basho.com/display/RIAK/Riak+Glossary#RiakGlossary-ConsistentHashing

    Thanks!
    Mark

    • How right you are, thanks for the correction. I suspect I confused myself because of the diagram including a ring on the first link. I’ll fix that up in the main article.

  6. Sam Granger,

    What about the usage of Solid State drives & read/writing? Sure, it’s pricey but must make a huge difference!

    • SSDs make a huge difference and I think that in the long run SSDs will make having to go to disk a non-issue across the board. The upside, right now, for NoSQL databases is that they run really well on cheap commodity hardware. In some cases, that SSD might be 25% extra on top of the cost of your server.

      I added a 256GB SSD to my laptop and it’s made a huge difference to startup times, but the $500+ cost was a bit steep and would be for most (sane) people.

  7. Anonymouse,

    > Judging by the amount of data that Foursquare reports to have, they aren’t likely to reach the point where an RDBMS can no longer scale for a very long time.

    Citation needed. Show me a Web2.0 site (with multi-millions of writing users) that “uses a relational database”. They are hard to find.

    “But Facebook uses MySQL and E-Bay uses Oracle” you say. Sadly, no. They both shard massively. That means no Joins (table fragments on different servers), No Transactions, no ACID (for many multi-table updates), and even Eventual Consistency for some updates. They are NOT “using a database”. They are using a sharding layer over _many_ databases. This sharding layer is using a tiny subset of features of the underlying database. It would be easy to replace the database with something simpler.

    At least MongoDB tries to put sharding “in the database” instead of “over the database”.

  8. Dan Lynch,

    This is a little bit misleading as not all of those engines actually solve the problems. For example if you have to query checkins by user and location, you can’t do that in riak or cassandra…. So that would mean more application code, which could have bugs.

    • You should probably go let SimpleGeo know, since they’ve built exactly that on top of Cassandra.

    • While Cassandra and Riak don’t directly solve the problem, they indirectly solve it. With Cassandra or Riak we need to structure our data differently. Every guide I’ve read about querying with Cassandra has made the comment that we should store your data the way we want to query it: if we need to run 15 different queries, then we need to store our data in 15 different ways.

      With Riak we need to build our own indexes. If we want to query for checkins by location, we should build a locations index as a separate bucket and maintain that bucket with post-commit hooks.

      Every solution solves a different set of problems.

  9. What about Apache’s CouchDB? It is similar to MongoDB and Riak in ways, but has it’s own positives and negatives. As far as clustering is concerned Cloudant’s BigCouch is really the best option there.

    How do you think CouchDB/BigCouch would fare in place of MongoDB at 4sq?

    • I wish I knew enough about CouchDB to answer that question. I suspect that CouchDB would have a similar set of issues to Riak, but that’s just a guess based on both CouchDB and Riak being key-value stores (which is almost certainly a wildly incorrect guess). A RESTful interface could prove to be very chatty and overwhelm your NIC (if you’ve done a REALLY bad job of writing your interfaces).

      In summation – I could only offer uneducated guesses about CouchDB :)

  10. Shimon Amit,

    The one certain thing which would have prevented the Foursquare incident is proactive monitoring. Any data store will eventually exhaust your resources unless preemptive measures are taken.

  11. Nice to see a writeup reminding everyone that relational DB’s actually still work!

    @Shimon – you are so right about proactive monitoring and with an application the size of FourSquare, you would think they would have a team devoted to such.

  12. there is a fundamental point that is not touched upon in the write up.
    132 GB would not be an absolute number across database types. had foursquare used a relational database, i am sure this data would be much less. This would be normalized data, which is by design not redundant.

    When you choose a document store such as mongodb, you denormalize extensively. With RDBMS, writes are byte poor and the storage is lean. With a document store, you try save a lot of redundant data right into the document, in a non relational way.
    In a basic sense, you have tweaked your system to be slow at write time (eventually consistent/deferred writes etc.) but blazing fast at read time.

    With RDBMS, you need to perform your computations (JOINS) at the time of reads. With a document store, you keep a document ready to serve at blazing speed at read time.

    • You’re correct, 132GB would not be an absolute number across all databases, but I suspect that the difference wouldn’t be much between each of the databases mentioned.

      And, yes, RDBMSes perform the data manipulation at read time, they also hold data in memory very aggressively so they can alleviate the computational hit from performing joins and to avoid disk reads. It’s also very easy to denormalize in an RDBMS and, sometimes, that’s the most sane solution. A lot of RDBMSes have disk compression features which make it cheap (from a storage perspective) to denormalize your data when it’s stored in an RDBMS. Denormalization has a place, regardless of the underlying storage mechanism.

Trackbacks

One Trackback Trackbacks are closed.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 531 items have been purified.