Update: I want to thank Ben Black, Todd Lipcon, and Kelley Reynolds for pointing out the inaccuracies in the original post. I’ve gone through the references they provided and made corrections.
Facebook, the data giant of the internet, recently unveiled a new messaging system. To be fair, I would normally ignore feature roll outs and marketing flimflam, but Facebook’s announcement is worthy of attention not only because of the underlying database but how the move was done.
Long ago, the engineers at Facebook developed a product named Cassandra to meet their needs for inbox search. This was a single purpose application, and a lot of Facebook’s data continued to live in MySQL clusters. Cassandra solves the needs of the original Facebook Inbox search very well.
Building a New Product
Whenever we solve a new problem, it’s a good idea to investigate all of the options available before settling on a solution. By their own admission, the developers at Facebook evaluated at several platforms (including Cassandra, MySQL, and Apache HBase) before deciding to use Apache HBase. What makes this decision interesting is not just the reasons that Apache HBase was chosen, but also the reasons that MySQL and Cassandra were not chosen.
Why Not Use MySQL?
It’s not MySQL can’t scale. MySQL can clearly scale – many high traffic, high data volume sites and applications use MySQL. The problem comes when you are scaling a relational database to a massive size – Facebook estimate that they will be storing well over 135 billion messages every month (15 billion person-to-person messages plus 120 billion chat messages). Even if these messages were limited in size to a single SMS (160 characters), that’s 21,600,000,000,000 bytes of data per month. Or, as we say where I come from, a buttload of data.
When you start storing that amount of data, strange things start to happen. Updates to indexes take a long time, data statistics get updated very rarely. With a use case like tracking email and messaging, prolific users of the messaging platform could cause massive data skew which would cause queries for other users to perform poorly.
Data sharding is a difficult problem to solve with an RDBMS at any scale, but when you move to the massive scales that Facebook is targeting, data sharding moves from a difficult problem to an incredibly difficult problem. Sharding algorithms needs to be carefully chosen to ensure that data is spread evenly across all of the database servers in production and the algorithm needs to be flexible enough to ensure that new servers can be easily added into the database without disrupting the existing set up.
As Todd Lipcon points out in the comments, one of the issues with scaling an RDBMS is the write pattern of B+-tree indexes. Writing data to a B+-tree index with an arbitrarily increasing integer key is fairly trivial – it’s effectively an append to the index. Things get trickier when the index is used to support various searches. Keying by email address or subject line will cause rows to be inserted in the middle of the index leading to all kinds of undesirable side effects like page splits and index fragmentation. Performing maintenance on large indexes is problematic in terms of time to complete, CPU time, and I/O issues.
Why Not Use Cassandra?
If MySQL can’t handle this kind of load, why not use Cassandra? Cassandra was purpose built to handle Facebook’s Inbox Search feature, after all.
Cassandra, unfortunately, has it’s own problems dealing with scale. One of the difficulties scaling Cassandra is a poorly designed partitioning scheme. It is safe to assume that the Facebook engineering team has enough experience with Cassandra to know how to effectively partition their data. If partitioning data isn’t a likely problem, what is?
Is replication the problem?
When a record is written to Cassandra using the data is immediately, and asynchronously, written to every other server that is receiving a replica of the data. Cassandra allows you to tune the replication settings varying from no write confirmation through full write confirmation. Different consistency levels are appropriate for different workloads, but they all carry different risks. For example, using a consistency level of
ConsistencyLevel.ALL would require every replica to respond before the write returns to the client as successful. Needless to say, this isn’t a recommended best practice. Benjamin Black (blog | twitter) informs me that the recommended best practice is to use
ConsistencyLevel.QUORUM with a replication factor of 3.
Of course, thinking about Cassandra’s implementation could be a non-issue because the Facebook implementation of Cassandra is proprietary and may have very few similarities to the current open source Apache Cassandra codebase.
Cassandra’s replication can be configured to meet the needs of a variety of workloads and deployment patterns and we can rule that out as the reason to not choose Cassandra. In this post from Facebook Engineering, the only concrete statement is “We found Cassandra’s eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure.” Cassandra is designed to be constantly available and eventually consistent – reads will always succeed but they may not always return the same data – the inconsistencies are resolved through a read repair process. HDFS (the data store underlying HBase) is designed to be strongly consistent, even if that consistency causes availability problems.
Cassandra is a known quantity at Facebook. Cassandra was, after all, developed at Facebook. The operations teams have a great deal of experience with supporting and scaling Cassandra. Why make the switch?
HBase’s replication model (actually, it’s the HDFS replication model) differs from Cassandra’s replication through replication pipelining. Replication pipelining makes it possible to have guarantees of data consistency – we know that every node will have the same data once a write has completed. We can guarantee write order (which is important for knowing when a message arrives in your inbox).
Because of HDFS’s data replication strategy, it’s possible to gain some automatic fault tolerance. The load will be pipelined through multiple servers – if any single server fails, the query can be served by another HDFS node with the data. Likewise, data replication makes it easy to handle node failures. If a node fails, queries for that specific region of data can be routed to one of the remaining replica servers. A new server can be added and the data from the failed server can easily be replicated to the new server.
In the spirit of full disclosure, this replication is not a feature specific to HBase/HDFS – Cassandra is able to do this as well. Todd Lipcon’s comment includes some information on the differences between HDFS and Cassandra replication. It’s important to remember that HBase/HDFS and Cassandra use two different replication methodologies. Cassandra is an highly available, eventually consistent system. HBase/HDFS is strongly consistent system. Each is appropriate for a certain types of tasks.
One of the criticisms of many NoSQL databases is that they do not log data before it is written. This is, clearly, not the case. Both HBase and Cassandra use a write ahead logger to make sure that data can be safely written to disk in the log before it is persisted to disk. This allows either data store, much like a relational database, to recover from crashes without having to write to disk every time a new row is inserted.
The technology behind HBase – Hadoop and HDFS – is very well understood and has been used previously at Facebook. During the development of their data warehouse, Facebook opted to use Hive. Hive is a set of tools for data analytics built on top of Hadoop and HDFS. While Hive performs very well with analytic queries on largely static data, it does not perform well with rapidly changing data. This makes Hive poorly suited for the new Social Messaging feature at Facebook. Since Hive makes use of Hadoop and HDFS, these shared technologies are well understood by Facebook’s operations teams. As a result, the same technology that allows Facebook to scale their data will be the technology that allows Facebook to scale their Social Messaging feature. The operations team already understands many of the problems they will encounter.
There are, of course, other technologies involved, but the most interesting part to me is the data store. Facebook’s decision to use HBase is a huge milestone. HBase has come of age and is being used by a prominent customer. Facebook developers are contributing their improvements back to the open source community – they’ve publicly said that they are running the open source version of Hadoop.