MongoDB - Now With Sharding

August 09, 2010

Version 1.6 of MongoDB was released last week while I was at devLink. What makes this an important point release? Two things: sharding and replica sets.

Sharding is a way of partitioning data. Data will be sharded by a key – this could be based on zip code, customer number, SKU, or any other aspect of the data. Sharding data in MongoDB occurs on a collection by collection basis. If you want related data to be on the same server you’ll need to pick a sharding key that is present in every collection. However, this design decision make sense because some collections may grow faster than others. The point of sharding is, in part, to spread load across multiple servers. What makes this important? Previously, sharding has been difficult to set up and administer and required custom application code to be written and maintained. More important, though, sharding gives MongoDB the ability to scale across multiple servers with minimal effort. What would happen if a single node in a set of sharded servers got very busy? Well, MongoDB would detect that one of the nodes is growing faster than the others and it would start balancing the load across the other servers. This might seem like it would violate my earlier statement about how we set MongoDB up to using a sharding key that we define. Here’s the catch: MongoDB only uses that sharding key when we set things up and when there are no problems. If things start getting busy, it will make changes to the sharding key. Those changes get reported throughout the entire cluster of servers and everyone knows where their data is, although nobody outside of the cluster really needs to care.

Replica sets are a new and improved way to perform data replication in MongoDB. Basically, we set up replication in a cluster of servers. If any single server fails, another server in the replica set will pick up the load. Once we’re able to get the dead server back up and running, the replica set will automatically start up a recovery process and our users will never know that there was an outage. There can be only one master server at any given time, so this protects us from master server failures. Through the magic of network heartbeats, we can be aware of all of the servers in the replica set. Interestingly, the master server is determined by a priority setting that is assigned to each server. This way, we could use older hardware to serve as a backup (or read-only server) to the master and use faster machines in the replica set to take over from the master in the event of any kind of hardware failure.

How It Works

[caption id=“attachment_1764” align=“alignnone” width=“572”] MongoDB Sharding Diagram[/caption]

Basically, here’s what happens (if you want more details, please see theSharding Introduction):

The mongos server is a router that makes our complicated MongoDB set up look like a single server to the application.
The mongod config servers maintain the shards. They know where data is stored and will attempt to balance the shards if any single node gets out of whack.
Replica sets provide localized redundancy for each shard key.

Gotchas

There are a few things to be aware when you’re considering sharding with MongoDB:

If a configuration server goes down, you can no longer reallocate data if any shards become write hot spots. This meta-data must be writeable for data to be repartitioned. You can still read and write data, but load will not be distributed.
Choose sharding keys wisely. An overly broad sharding key will do you no good: all data can end up on one node and you will be unable to split the data onto multiple nodes.
Some queries will use multiple shards – make sure you understand data distribution, querying patterns, and potential sharding keys.

Photo Credits

glass litter by psyberartist – Creative Commons Licensed I thought I saw a puddy cat… by Keven Law – Creative Commons Licensed