Dec 20, 2012

mongoDB Sharding

If I should have made some safe bets on the near future, I would choose two: Hadoop and mongoDB. 

There is a huge demand for both technologies and many players consider these technologies as a foundation for their future products.

Sharding?
MySQL Sharding was a major issue for large scale installations and it is the same for mongoDB large installations.

Back to Basics
mongo is pretty similar to a regular database, but it has two main advantages: 1) Software engineers love it as it can easily be used for object persist-ency and 2) it support unstructured objects (documents) that can easily store different objects based on the same virtual class.

mongoDB terms

  1. Database: database
  2. Collections: very similar to tables.
  3. Documents: very similar to rows. Yet, a document can be as flexible as a JSON document can be. For example, it may include 1 to many fields in the document itself.
  4. mongod: a mongoDB instance or shard.
  5. Chunk: a 64MB storage unit that stores documents.
  6. Config database: Chunks to mongos mapping directory.
Why use sharding?
  1. Support large dataset using commodity servers.
  2. Support high IO requirements using commodity disks.
What are mongoDB sharding features?


  1. Range-based Data Partitioning: a very similar method to MySQL partitioning. You should choose one or more fields (shard key) that sharding will be based on. You should choose a shard key according to the business logic, like splitting according to account id in a SaaS application.
  2. Automatic Data Volume Distribution: mongoDB will take care of the shards balancing by itself according to the chosen shard key.
  3. Transparent Query Routing: mongoDB takes care of queries map reduce to multiple shared by itself when a query does not match the shard key (very much like Hadoop).
Key Recommendations for mongoDB Sharding
  1. Sufficient Carnality: choose a shard key that can be split later to more shards if a database size is getting too large (exceeds chunk size).
  2. Uniform Distribution: choose a sharding key that will spread a in uniform distribution to avoid unbalanced design.
  3. Distribute Write Operations: if you have a billing system, prefer to shard according to account id rather than shard according to billing month. Otherwise, in a given day, probably only a single shard will be used.
  4. Query according to the shard key: if any of your queries will include the shard key, each of your queries will result in a single shard query. Otherwise, it will generate N queries (one per shard).
Technical Aspects for mongoDB Sharding
  1. Every sharded collection must have an index that its first fields are the shard key (use shardCollection for that).
  2. Chunk size default limit is 64MB
  3. When a chunk reaches this limit, mongoDB will split it to two.
  4. If chunks are not distributed uniformly, mongoDB will start migrating chunks between different mongos.
  5. Cluster Balancer is taking care of this process.
  6. Balancing can cause performance issues and therefore can be restricted to off peak hours (nights and weekends for example) using balancing windows.
  7. The shards mapping to mongos is saved at the config database.
  8. Replication should be considered as well  a complementary method.
Bottom Line
mongoDB brings to the table an out of the box sharding solution that can scale your operations. Now, you only need to analyze your needs and select the right solution for them.

Keep Performing,

ShareThis

Intense Debate Comments

Ratings and Recommendations