Mammatus Blog

MongoDB Database: Replica Set, Autosharding, Journaling, Architecture Part 2

posted Sep 30, 2012, 8:00 PM by Rick Hightower   [ updated Sep 30, 2012, 8:06 PM ]

See part 1 of MongoDB Architecture...

Journaling: Is durability overvalued if RAM is the new Disk? Data Safety versus durability

It may seem strange to some that journaling was added as late as version 1.8 to MongoDB. Journaling is only now the default for 64 bit OS for MongoDB 2.0. Prior to that, you typically used replication to make sure write operations were copied to a replica before proceeding if the data was very important. The thought being that one server might go down, but two servers are very unlikely to go down at the same time. Unless somebody backs a truck over a high voltage utility poll causing all of your air conditioning equipment to stop working long enough for all of your servers to overheat at once, but that never happens (it happened to Rackspace and Amazon). And if you were worried about this, you would have replication across availability zones, but I digress.

At one point MongoDB did not have single server durability, now it does with addition of journaling. But, this is far from a moot point. The general thought from MongoDB community was and maybe still is that to achieve Web Scale, durability was thing of the past. After allmemory is the new disk. If you could get the data on second server or two, then the chances of them all going down at once is very, very low. How often do servers go down these days? What are the chances of two servers going down at once? The general thought from MongoDB community was (is?) durability is overvalued and was just not Web Scale. Whether this is a valid point or not, there was much fun made about this at MongoDB's expense (rated R, Mature 17+).

As you recall MongoDB uses memory mapped file for its storage engine so it could be a while for the data in memory to get synced to disk by the operating system. Thus if you did have several machines go down at once (which should be very rare), complete recoverability would be impossible. There were workaround with tradeoffs, for example to get around this (now non issue) or minimize this issue, you could force MongoDB to do an fsync of the data in memory to the file system, but as you guessed even with a RAID level four and a really awesome server that can get slow quick. The moral of the story is MongoDB has journaling as well as many other options so you can decide what the best engineering tradeoff in data safety, raw speed and scalability. You get to pick. Choose wisely.

The reality is that no solution offers "complete" reliability, and if you are willing to allow for some loss (which you can with some data), you can get enormous improvements in speed and scale. Let's face it your virtual farm game data is just not as important as Wells Fargo's bank transactions. I know your mom will get upset when she loses the virtual tractor she bought for her virtual farm with her virtual money, but unless she pays real money she will likely get over it. I've lost a few posts on twitter over the years, and I have not sued once. If your servers have an uptime of 99 percent and you block/replicate to three servers than the probability of them all going down at once is (0.000001) so the probability of them all going down is 1 in 1,000,000. Of course uptime of modern operating systems (Linux) is much higher than this so one in 100,000,000 or more is possible with just three servers. Amazon EC2 offers discounts if they can't maintain an SLA of 99.95% (other cloud providers have even higher SLAs). If you were worried about geographic problems you could replicate to another availability zone or geographic area connected with a high-speed WAN. How much speed and reliability do you need? How much money do you have?

An article on when to use MongoDB journaling versus older recommendations will be a welcome addition. Generally it seems journaling is mostly a requirement for very sensitive financial data and single server solutions. Your results may vary, and don't trust my math, it has been a few years since I got a B+ in statistics, and I am no expert on SLA of modern commodity servers (the above was just spit balling).

If you have ever used a single non-clustered RDBMS system for a production system that relied on frequent backups and transaction log (journaling) for data safety, raise your hand. Ok, if you raised your hand, then you just may not need autosharding or replica sets. To start with MongoDB, just use a single server with journaling turned on. If you require speed, you can configure MongoDB journaling to batch writes to the journal (which is the default). This is a good model to start out with and probably very much like quite a few application you already worked on (assuming that most application don't need high availability). The difference is, of course, if later your application deemed to need high availability, read scalability, or write scalability, MongoDB has your covered. Also setting up high availability seems easier on MongoDB than other more established solutions.

Figure 3: Simple setup with journaling and single server ok for a lot of applications


Simple Non-Sharded, non replicated installation


If you can afford two other servers and your app reads more than it writes, you can get improved high availability and increased read scalability with replica sets. If your application is write intensive then you might need autosharding. The point is you don't have to be Facebook or Twitter to use MongoDB. You can even be working on a one-off dinky application. MongoDB scales down as well as up.


Replica sets are good for failover and speeding up reads, but to speed up writes, you need autosharding. According to a talk by Roger Bodamer on Scaling with MongoDB, 90% of projects do not need autosharding. Conversely almost all projects will benefit from replication and high availability provided by replica sets. Also once MongoDB improves its concurrency in version 2.2 and beyond, it may be the case that 97% of projects don't need autosharding. 

Sharding allows MongoDB to scale horizontally. Sharding is also called partitioning. You partition each of your servers a portion of the data to hold or the system does this for you. MongoDB can automatically change partitions for optimal data distribution and load balancing, and it allows you to elastically add new nodes (MongoDB instances). How to setup autosharding is beyond the scope of this introductory article. Autosharding can support automatic failover (along with replica sets). There is no single point of failure. Remember 90% of deployments don’t need sharding, but if you do need scalable writes (apps like Foursquare, Twitter, etc.), then autosharding was designed to work with minimal impact on your client code.

There are three main process actors for autosharding: mongod (database daemon), mongos, and the client driver library. Each mongod instance gets a shard. Mongod is the process that manages databases, and collections. Mongos is a router, it routes writes to the correct mongod instance for autosharding. Mongos also handles looking for which shards will have data for a query. To the client driver, mongos looks like a mongod process more or less (autosharding is transparent to the client drivers).

Figure 4: MongoDB Autosharding

Sharding with MongoDB

Autosharding increases write and read throughput, and helps with scale out. Replica sets are for high availability and read throughput. You can combine them as shown in figure 5.

Figure 5: MongoDB Autosharding plus Replica Sets for scalable reads, scalable writes, and high availability

MongoDB Autosharding for Scalable Reads, Scalable Writes and High Availability

You shard on an indexed field in a document. Mongos collaborates with config servers(mongod instances acting as config servers), which have the shard topology (where do the key ranges live). Shards are just normal mongod instances. Config servers hold meta-data about the cluster and are also mongodb instances. 

Shards are further broken down into 64 MB chunks called chunks. A chunk is 64 MB worth of documents for a collection. Config servers hold which shard the chunks live in. The autosharding happens by moving these chunks around and distributing them into individual shards. The mongos processes have a balancer routine that wakes up so often, it checks to see how many chunks a particular shard has. If a particular shard has too many chunks (nine more chunks than another shard), then mongos starts to move data from one shard to another to balance the data capacity amongst the shards. Once the data is moved then the config servers are updated in a two phase commit (updates to shard topology are only allowed if all three config servers are up).

The config servers contain a versioned shard topology and are the gatekeeper for autosharding balancing. This topology maps which shard has which keys. The config servers are like DNS server for shards. The mongos process uses config servers to find where shard keys live. Mongod instances are shards that can be replicated using replica sets for high availability. Mongos and config server processes do not need to be on their own server and can live on a primary box of a replica set for example. For sharding you need at least three config servers, and shard topologies cannot change unless all three are up at the same time. This ensures consistency of the shard topology. The full autosharding topology is show in figure 6. An excellent talk on the internals of MongoDB sharding was done by Kristina Chodorow, author of Scaling MongoDB, at OSCON 2011 if you would like to know more.

If you would like to learn more about MongoDB consider the following resources:


    Figure 6: MongoDB Autosharding full topology for large deployment including Replica Sets, Mongos routers, Mongod Instance, and Config Servers 

    MongoDB Autosharding full topology for large deployment including Replica Sets, mongos routers, mongod instance, client drivers and config servers


    Installing MongoDB Database to work with PHP and Apache (tutorial)

    posted Sep 30, 2012, 2:24 PM by Rick Hightower   [ updated Sep 30, 2012, 3:15 PM ]

    Installing and setting up MongoDB to work with PHP

    See installing MongoDB to install MongoDB.

    Node.js , Ruby, and Python in that order are the trend setter crowd in our industry circa 2012. Java is the corporate crowd, and PHP is the workhorse of the Internet. The "get it done" crowd. You can't have a decent NoSQL solution without having good PHP support. Installing MongoDB to work with PHP is simple.

    To install MongoDB support with PHP use pecl as follows:

    $ sudo pecl install mongo

    Add the module to php.ini.

    Then assuming you are running it on apache, restart as follows:

    $ apachectl stop
    $ apachectl start

    Figure 1 shows our roughly equivalent code listing in PHP.

    Figure 1 PHP code listing

    PHP MongoDB

    The output for figure 1 is as follows:

    array ( '_id' => MongoId::__set_state(array( '$id' => '4f964d3000b5874e7a163895', )), 'name' => 'Rick Hightower', 
    'gender' => 'm', 'phone' => '520-555-1212', 'age' => 42, )
    array ( '_id' => MongoId::__set_state(array( '$id' => '4f984cae72329d0ecd8716c8', )), 'name' => 'Diana Hightower', 'gender' => ‘f', 
    'phone' => '520-555-1212', 'age' => 30, )
    array ( '_id' => MongoId::__set_state(array( '$id' => '4f9e170580cbd54f27000000', )), 'gender' => 'm', 'age' => 8, 'name' => 'Lucas Hightower', 
    'phone' => '520-555-1212', )

    The other half of the equation is in figure 2.

    Figure 2 PHP code listing

    PHP, MongoDB

    The output for figure 2 is as follows:

    array ( '_id' => MongoId..., 'name' => 'Rick Hightower', 'gender' => 'm', 
    'phone' => '520-555-1212', 'age' => 42, )
    array ( '_id' => MongoId::..., 'name' => 'Diana Hightower', 'gender' => ‘f', 
    'phone' => '520-555-1212', 'age' => 30, )
    Diana by id? 
    array ( '_id' => MongoId::..., 'name' => 'Diana Hightower', 'gender' => 'f', 
    'phone' => '520-555-1212', 'age' => 30, )

    Here is the complete PHP listing.

    PHP complete listing

    $m = new Mongo();
    $db = $m->selectDB("tutorial");
    $employees = $db->selectCollection("employees");
    $cursor = $employees->find();
    foreach ($cursor as $employee) {
      echo var_export ($employee, true) . "< br />";
    $cursor=$employees->find( array( "name" => "Rick Hightower"));
    echo "Rick? < br /> " . var_export($cursor->getNext(), true);
    $cursor=$employees->find(array("age" => array('$lt' => 35)));
    echo "Diana? < br /> " . var_export($cursor->getNext(), true);
    $cursor=$employees->find(array("_id" => new MongoId("4f984cce72320612f8f432bb")));
    echo "Diana by id? < br /> " . var_export($cursor->getNext(), true);

    If you like Object mapping to documents you should try the poorly named MongoLoid for PHP.

    If you would like to learn more about MongoDB consider the following resources:

      Mongo DB is wrong. It is MongoDB. Always put the DB next to Mongo. 

      Installing and Setting up MongoDB Database with Python (Mongo DB tutorial for Python)

      posted Sep 30, 2012, 2:20 PM by Rick Hightower   [ updated Oct 7, 2012, 11:27 PM ]

      Python MongoDB Setup

      See setup guide to see how to install MongoDB.

      Setting up Python and MongoDB are quite easy since Python has its own package manager.

      To install mongodb lib for Python MAC OSX, you would do the following:

      $ sudo env ARCHFLAGS='-arch i386 -arch x86_64'
      $ python -m easy_install pymongo

      To install Python MongoDB on Linux or Windows do the following:

      $ easy_install pymongo


      $ pip install pymongo

      If you don't have easy_install on your Linux box you may have to do some sudo apt-get install python-setuptools or sudo yum install python-setuptools iterations, although it seems to be usually installed with most Linux distributions these days. If easy_install or pip is not installed on Windows, try reformatting your hard disk and installing a real OS, or if that is too inconvenient go here. The key here is to install pymongo.

      Once you have it all setup, you will can create some code that is equivalent to the first console examples as shown in figure 1.

      Figure 1: Python code listing part 1

      Python, MongoDB, Pymongo

      Python does have literals for maps so working with Python is much closer to the JavaScript/Console from earlier than Java is. Like Java there are libraries for Python that work with MongoDB (MongoEngineMongoKit, and more). Even executing queries is very close to the JavaScript experience as shown in figure 2.

      Figure 2: Python code listing part 2

      Python, MongoDB, Pymongo

      Here is the complete listing to make the cut and paste crowd (like me), happy.

      Listing: Complete Python listing

      import pymongo
      from bson.objectid import ObjectId
      connection = pymongo.Connection()
      db = connection["tutorial"]
      employees = db["employees"]
      employees.insert({"name": "Lucas Hightower", 'gender':'m', 'phone':'520-555-1212', 'age':8})
      cursor = db.employees.find()
      for employee in db.employees.find():
          print employee
      print employees.find({"name":"Rick Hightower"})[0]
      cursor = employees.find({"age": {"$lt": 35}})
      for employee in cursor:
           print "under 35: %s" % employee
      diana = employees.find_one({"_id":ObjectId("4f984cce72320612f8f432bb")})
      print "Diana %s" % diana

      If you would like to learn more about MongoDB consider the following resources:

        The output for the Python example is as follows:

        {u'gender': u'm', u'age': 42.0, u'_id': ObjectId('4f964d3000b5874e7a163895'), u'name': u'Rick Hightower', u'phone':
        {u'gender': u'f', u'age': 30, u'_id': ObjectId('4f984cae72329d0ecd8716c8'), u'name': u'Diana Hightower', u'phone':
        {u'gender': u'm', u'age': 8, u'_id': ObjectId('4f9e111980cbd54eea000000'), u'name': u'Lucas Hightower', u'phone':

        Installing MongoDB Database to Work with Java (MongoDB Java Tutorial)

        posted Sep 30, 2012, 2:16 PM by Rick Hightower   [ updated Sep 30, 2012, 3:14 PM ]

        Java and MongoDB

        See install guide for a quick MongoDB Tutorial.

        Pssst! Here is a dirtly little secret. Don't tell your Node.js friends or Ruby friends this. More Java developers use MongoDB than Ruby and Node.js. They just are not as loud about it. Using MongoDB with Java is very easy. 

        The language driver for Java seems to be a straight port of something written with JavaScript in mind, and the usuability suffers a bit because Java does not have literals for maps/objects like JavaScript does. Thus an API written for a dynamic langauge does not quite fit Java. There can be a lot of useability improvement in the MongoDB Java langauge driver (hint, hint). There are alternatives to using just the straight MongoDB language driver, but I have not picked a clear winner (mjormmorphia, and Spring data MongoDB support). I'd love just some usuability improvements in the core driver without the typical Java annotation fetish, perhaps a nice Java DAO DSL (see section on criteria DSL if you follow the link). 

        Setting up Java and MongoDB

        Let's go ahead and get started then with Java and MongoDB.

        Download latest mongo driver from github (, then put it somewhere, and then add it to your classpath as follows:

        $ mkdir tools/mongodb/lib
        $ cp mongo-2.7.3.jar tools/mongodb/lib

        Assuming you are using Eclipse, but if not by now you know how to translate these instructions to your IDE anyway. The short story is put the mongo jar file on your classpath. You can put the jar file anywhere, but I like to keep mine in ~/tools/.

        If you are using Eclipse it is best to create a classpath variable so other projects can use the same variable and not go through the trouble. Create new Eclipse Java project in a new Workspace. Now right click your new project, open the project properties, go to the Java Build Path->Libraries->Add Variable->Configure Variable shown in figure 7.

        Figure 7: Adding Mongo jar file as a classpath variable in Eclipse

        Eclipse, Java, MongoD

        For Eclipse from the "Project Properties->Java Build Path->Libraries", click "Add Variable", select "MONGO", click "Extend…", select the jar file you just downloaded.

        Figure 8: Adding Mongo jar file to your project

        Eclipse, MongoDB, Java

        Once you have it all setup, working with Java and MongoDB is quite easy as shown in figure 9.

        Figure 9 Using MongoDB from Eclipse

        Java and MongoDB

        The above is roughly equivalent to the console/JavaScript code that we were doing earlier. TheBasicDBObject is a type of Map with some convenience methods added. The DBCursor is like a JDBC ResultSet. You execute queries with DBColleciton. There is no query syntax, just finder methods on the collection object. The output from the above is:

        { "_id" : { "$oid" : "4f964d3000b5874e7a163895"} , "name" : "Rick
        Hightower" , "gender" : "m" , "phone" : "520-555-1212" ,
        "age" : 42.0}
        { "_id" : { "$oid" : "4f984cce72320612f8f432bb"} , "name" : "Diana
        Hightower" , "gender" : "f" , "phone" : "520-555-1212" ,
        "age" : 30}


        Once you create some documents, querying for them is quite simple as show in figure 10.

        Figure 10: Using Java to query MongoDB 

        Java and MongoDB

        The output from figure 10 is as follows:

        { "_id" : { "$oid" : "4f964d3000b5874e7a163895"} , "name" : "Rick
        Hightower" , "gender" : "m" , "phone" : "520-555-1212" ,
        "age" : 42.0}
        { "_id" : { "$oid" : "4f984cae72329d0ecd8716c8"} , "name" : "Diana
        Hightower" , "gender" : "f" , "phone" : "520-555-1212" ,
        "age" : 30}
        Diana by object id?
        { "_id" : { "$oid" : "4f984cce72320612f8f432bb"} , "name" : "Diana
        Hightower" , "gender" : "f" , "phone" : "520-555-1212" ,
        "age" : 30}

        Just in case anybody wants to cut and paste any of the above, here it is again all in one go in the following listing.

        Listing: Complete Java Listing

        package com.mammatustech.mongo.tutorial;
        import org.bson.types.ObjectId;
        import com.mongodb.BasicDBObject;
        import com.mongodb.DBCollection;
        import com.mongodb.DBCursor;
        import com.mongodb.DBObject;
        import com.mongodb.Mongo;
        import com.mongodb.DB;
        public class Mongo1Main {
        	public static void main (String [] args) throws Exception {
        		Mongo mongo = new Mongo();
        		DB db = mongo.getDB("tutorial");
        		DBCollection employees = db.getCollection("employees");
        		employees.insert(new BasicDBObject().append("name", "Diana Hightower")
        		  .append("gender", "f").append("phone", "520-555-1212").append("age", 30));
        		DBCursor cursor = employees.find();
        		while (cursor.hasNext()) {
        			DBObject object =;
        		//> db.employees.find({name:"Rick Hightower"})
        		cursor=employees.find(new BasicDBObject().append("name", "Rick Hightower"));
        		//> db.employees.find({age:{$lt:35}})	
        		BasicDBObject query = new BasicDBObject();
        	        query.put("age", new BasicDBObject("$lt", 35));
        		//> db.employees.findOne({_id : ObjectId("4f984cce72320612f8f432bb")})
        		DBObject dbObject = employees.findOne(new BasicDBObject().append("_id", 
        				new ObjectId("4f984cce72320612f8f432bb")));
        		System.out.printf("Diana by object id?\n%s\n", dbObject);

        Please note that the above is completely missing any error checking, or resource cleanup. You will need do some of course (try/catch/finally, close connection, you know that sort of thing).

        If you would like to learn more about MongoDB consider the following resources:

          Mongo DB is wrong, MongoDB always put the DB next to Mongo. 

          Installing MongoDB Database

          posted Sep 30, 2012, 2:15 PM by Rick Hightower   [ updated Sep 30, 2012, 3:13 PM ]

          Installing MongoDB Database: Guide to getting started with MongoDB Database and install guide 

          Let's mix in some code samples to try out along with the concepts.

          To install MongoDB go to their download page, download and untar/unzip the download to~/mongodb-platform-version/. Next you want to create the directory that will hold the data and create a mongodb.config file (/etc/mongodb/mongodb.config) that points to said directory as follows:

          Listing: Installing MongoDB

          $ sudo mkdir /etc/mongodb/data
          $ cat /etc/mongodb/mongodb.config 

          The /etc/mongodb/mongodb.config has one line dbpath=/etc/mongodb/data that tells mongo where to put the data. Next, you need to link mongodb to /usr/local/mongodb and then add it to the path environment variable as follows:

          Listing: Setting up MongoDB on your path

          $ sudo ln -s  ~/mongodb-platform-version/  /usr/local/mongodb
          $ export PATH=$PATH:/usr/local/mongodb/bin

          Run the server passing the configuration file that we created earlier.

          Listing: Running the MongoDB server

          $ mongod --config /etc/mongodb/mongodb.config

          Short tutorial on using MongoDB

          Mongo comes with a nice console application called mongo that let's you execute commands and JavaScript. JavaScript to Mongo is what PL/SQL is to Oracle's database. Let's fire up the console app, and poke around.

          Firing up the mongos console application

          $ mongo
          MongoDB shell version: 2.0.4
          connecting to: test
          > db.version()

          One of the nice things about MongoDB is the self describing console. It is easy to see what commands a MongoDB database supports with the as follows:

          Client: mongo

          DB methods:
          db.addUser(username, password[, readOnly=false])
          db.auth(username, password)
          db.commandHelp(name) returns the help for the command
          db.copyDatabase(fromdb, todb, fromhost)
          db.createCollection(name, { size : ..., capped : ..., max : ... } )
          db.currentOp() displays the current operation in the db
          db.eval(func, args) run code server-side
          db.getCollection(cname) same as db['cname'] or db.cname
          db.getLastError() - just returns the err msg string
          db.getLastErrorObj() - return full status object
          db.getMongo() get the server connection object
          db.getMongo().setSlaveOk() allow this connection to read from the nonmaster member of a replica pair
          db.getProfilingStatus() - returns if profiling is on and slow threshold 
          db.getSiblingDB(name) get the db at the same server as this one
          db.isMaster() check replica primary status
          db.killOp(opid) kills the current operation in the db
          db.listCommands() lists all the db commands
          db.runCommand(cmdObj) run a database command.  if cmdObj is a string, turns it into { cmdObj : 1 }
          db.setProfilingLevel(level,{slowms}) 0=off 1=slow 2=all
          db.version() current version of the server
          db.getMongo().setSlaveOk() allow queries on a replication slave server
          db.fsyncLock() flush data to disk and lock server for backups
          db.fsyncUnock() unlocks server following a db.fsyncLock() 

          You can see some of the commands refer to concepts we discussed earlier. Now let's create a employee collection, and do some CRUD operations on it.

          Create Employee Collection

           > use tutorial; 
          switched to db tutorial 
          > db.getCollectionNames(); [ ]
           > db.employees.insert({name:'Rick Hightower', gender:'m', gender:'m', phone:'520-555-1212', age:42}); 
          Mon Apr 23 23:50:24 [FileAllocator] allocating new datafile /etc/mongodb/data/tutorial.ns, ...

          The use command uses a database. If that database does not exist, it will be lazily created the first time we access it (write to it). The db object refers to the current database. The current database does not have any document collections to start with (this is why db.getCollections() returns an empty list). To create a document collection, just insert a new document. Collections like databases are lazily created when they are actually used. You can see that two collections are created when we inserted our first document into the employees collection as follows:


          > db.getCollectionNames();
          [ "employees", "system.indexes" ]

          The first collection is our employees collection and the second collection is used to hold onto indexes we create.

          To list all employees you just call the find method on the employees collection.

          > db.employees.find()
          { "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", 
              "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

          The above is the query syntax for MongoDB. There is not a separate SQL like language. You just execute JavaScript code, passing documents, which are just JavaScript associative arrays, err, I mean JavaScript objects. To find a particular employee, you do this:

          > db.employees.find({name:"Bob"})

          Bob quit so to find another employee, you would do this:

          > db.employees.find({name:"Rick Hightower"})
          { "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

          The console application just prints out the document right to the screen. I don't feel 42. At least I am not 100 as shown by this query:

          > db.employees.find({age:{$lt:100}})
          { "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

          Notice to get employees less than a 100, you pass a document with a subdocument, the key is the operator ($lt), and the value is the value (100). Mongo supports all of the operators you would expect like $lt for less than, $gt for greater than, etc. If you know JavaScript, it is easy to inspect fields of a document, as follows:

          > db.employees.find({age:{$lt:100}})[0].name
          Rick Hightower

          If we were going to querysort or shard on, then we would need to create an index as follows:

          db.employees.ensureIndex({name:1}); //ascending index, descending would be -1

          Indexing by default is a blocking operation, so if you are indexing a large collection, it could take several minutes and perhaps much longer. This is not something you want to do casually on a production system. There are options to build indexes as a background taskto setup a unique index, and complications around indexing on replica sets, and much more. If you are running queries that rely on certain indexes to be performant, you can check to see if an index exists with db.employees.getIndexes(). You can also see a list of indexes as follows:

          > db.system.indexes.find()
          { "v" : 1, "key" : { "_id" : 1 }, "ns" : "tutorial.employees", "name" : "_id_" }

          If you would like to learn more about MongoDB consider the following resources:

            By default all documents get an object id. If you don't not give it an object an _id, it will be assigned one by the system (like a criminal suspects gets a lawyer). You can use that _id to look up an object as follows with findOne:

            > db.employees.findOne({_id : ObjectId("4f964d3000b5874e7a163895")})
            { "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", 
               "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

            MongoDB Database: Architecture Replica Sets, Autosharding (part 1)

            posted Sep 27, 2012, 1:45 AM by Rick Hightower   [ updated Sep 30, 2012, 3:15 PM ]

            The model of MongoDB is such that you can start basic and use features as your growth/needs change without too much trouble or change in design. MongoDB uses replica sets to provide read scalability, and high availability. Autosharding is used to scale writes (and reads). Replica sets and autosharding go hand in hand if you need mass scale out. With MongoDB scaling out seems easier than traditional approaches as many things seem to come built-in and happen automatically. Less operation/administration and a lower TCO than other solutions seems likely. However you still need capacity planning (good guess), monitoring (test your guess), and the ability to determine your current needs (adjust your guess).

            Replica Sets

            The major advantages of replica sets are business continuity through high availability, data safety through data redundancy, and read scalability through load sharing (reads). Replica sets use a share nothing architecture. A fair bit of the brains of replica sets is in the client libraries. The client libraries are replica set aware. With replica sets, MongoDB language drivers know the current primary. Langauge driver is a library for a particular programing language, think JDBC driver or ODBC driver, but for MongoDB. All write operations go to the primary. If the primary is down, the drivers know how to get to the new primary (an elected new primary), this is auto failover for high availability. The data is replicated after writing. Drivers always write to the replica set's primary (called the master), the master then replicates to slaves. The primary is not fixed. The master/primary is nominated.

            Typically you have at least three MongoDB instances in a replica set on different server machines (see figure 2). You can add more replicas of the primary if you like for read scalability, but you only need three for high availability failover. There is a way to sort of get down to two, but let's leave that out for this article. Except for this small tidbit, there are advantages of having three versus two in general. If you have two instances and one goes down, the remaining instance has 200% more load than before. If you have three instances and one goes down, the load for the remaining instances only go up by 50%. If you run your boxes at 50% capacity typically and you have an outage that means your boxes will run at 75% capacity until you get the remaining box repaired or replaced. If business continuity is your thing or important for your application, then having at least three instances in a replica set sounds like a good plan anyway (not all applications need it).


            Figure 2: Replica Sets

            MongoDB Replica Sets


            In general, once replication is setup it just works. However, in your monitoring, which is easy to do in MongoDB, you want to see how fast data is getting replicated from the primary (master) to replicas (slaves). The slower the replication is, the dirtier your reads are. The replication is by default async (non-blocking). Slave data and primary data can be out of sync for however long it takes to do the replication. There are already whole books written just on making Mongo scalable, if you work at a Foursquare like company or a company where high availability is very important, and use Mongo, I suggest reading such a book.

            By default replication is non-blocking/async. This might be acceptable for some data (category descriptions in an online store), but not other data (shopping cart's credit card transaction data). For important data, the client can block until data is replicated on all servers or written to the journal (journaling is optional). The client can force the master to sync to slaves before continuing. This sync blocking is slower. Async/non-blocking is faster and is often described as eventual consistency. Waiting for a master to sync is a form of data safety. There are several forms of data safety and options available to MongoDB from syncing to at least one other server to waiting for the data to be written to a journal (durability). Here is a list of some data safety options for MongoDB:

            1. Wait until write has happened on all replicas
            2. Wait until write is on two servers (primary and one other)
            3. Wait until write has occurred on majority of replicas
            4. Wait until write operation has been written to journal

            (The above is not an exhaustive list of options.) The key word of each option above is wait. The more syncing and durability the more waiting, and harder it is to scale cost effectively.

            If you would like to learn more about MongoDB consider the following resources:

              MongoDB versus MySQL and Oracle

              posted Sep 23, 2012, 3:05 PM by Rick Hightower   [ updated Sep 27, 2012, 1:49 AM ]

              MongoDB is document oriented but has many comparable concepts to traditional SQL/RDBMS solutions.

              1. Oracle: Schema, Tables, Rows, Columns
              2. MySQL: Database, Tables, Rows, Columns
              3. MongoDB: Database, Collections, Document, Fields
              4. MySQL/Oracle: Indexes
              5. MongoDB: Indexes
              6. MySQL/Oracle: Stored Procedures
              7. MongoDB: Stored JavaScript
              8. Oracle/MySQL: Database Schema
              9. MongoDB: Schema free!
              10. Oracle/MySQL: Foreign keys, and joins
              11. MongoDB: DBRefs, but mostly handled by client code
              12. Oracle/MySQL: Primary key
              13. MongoDB: ObjectID

              If you have used MySQL or Oracle here is a good guide to similar processes in MongoDB:

              Database Process TypeOracleMySQLMongoDB
              Console Clientsqlplusmysqlmongo
              Backup utilitysqlplusmysqldumpmongodump
              Import utilitysqlplusmysqlimportmongoimport

              You can see a trend here. Where possible, MongoDB tries to follow the terminology of MySQL. They do this with console commands as well. If you are used to using MySQL, where possible, MongoDB tries to make the transition a bit less painful.


              SQL operations versus MongoDB operations

              MongoDB queries are similar in concept to SQL queries and use a lot of the same terminology. There is no special language or syntax to execute MongoDB queries; you simply assemble a JSON object. The MongoDB site has a complete set of example queries done in both SQL and MongoDB JSON docs to highlight the conceptual similarities. What follows is several small listings to compare MongoDB operations to SQL.



              db.contacts.insert({name:'RICK HIGHTOWER',phoneNumber:'520-555-1212'})


              SELECT name, phone_number FROM contacts WHERE age=30 ORDER BY name DESC
              db.contacts.find({age:30}, {name:1,phoneNumber:1}).sort({name:-1})
              SELECT name, phone_number FROM contacts WHERE age>30 ORDER BY name DESC
              db.contacts.find({age:{$gt:33}}, {name:1,phoneNumber:1}).sort({name:-1})

              Creating indexes

              CREATE INDEX contact_name_idx ON contact(name DESC)


              UPDATE contacts SET phoneNumber='415-555-1212' WHERE name='Rick Hightower'
              db.contacts.update({name:'Rick Hightower'}, {$set:{phoneNumber:1}}, false, true)


              Additional features of note

              MongoDB has many useful features like Geo Indexing (How close am I to X?), distributed file storage, capped collection (older documents auto-deleted), aggregation framework (like SQL projections for distributed nodes without the complexities of MapReduce for basic operations on distributed nodes), load sharing for reads via replication, auto sharding for scaling writes, high availability, and your choice of durability (journaling) and/or data safety (make sure a copy exists on other servers).

              If you would like to learn more about MongoDB consider the following resources:

              MongoDB Caveats and Warnings

              posted Sep 23, 2012, 3:00 PM by Rick Hightower   [ updated Sep 23, 2012, 3:06 PM ]


              MongoDB indexes may not be as flexible as Oracle/MySQL/Postgres or other even other NoSQL solutions. The order of index matters as it uses B-TreesRealtime queries might not be as fast as Oracle/MySQL and other NoSQL solutions especially when dealing with array fields, and the query plan optimization work is not as far as long as more mature solutions. You can make sure MongoDB is using the indexes you setup quite easily with an explain function. Not to scare you, MongoDB is good enough if queries are simple and you do a little homework, and is always improving.

              MongoDB does not have or integrate a full text search engine like many other NoSQL solutions do (many use Lucene under the covers for indexing), although it seems to support basic text search better than most traditional databases.

              Every version of MongoDB seems to add more features and addresses shortcoming of the previous releases. MongoDB added journaling a while back so they can have single server durability; prior to this you really needed a replica or two to ensure some level of data safety. 10gen improved Mongo's replication and high availability with Replica Sets.

              Another issue with current versions of MongoDB (2.0.5) is lack of concurrency due to MongoDB having a global read/write lock, which allows reads to happen concurrently while write operations happen one at a time. There are workarounds for this involving shards, and/or replica sets, but not always ideal, and does not fit the "it should just work mantra" of MongoDB. Recently at the MongoSF conference, Dwight Merriman, co-founder and CEO of 10gen, discussed the concurrency internals of MongoDB v2.2 (future release). Dwight described that MongoDB 2.2 did a major refactor to add database level concurrency, and will soon have collection level concurrency now that the hard part of the concurrency refactoring is done. Also keep in mind, writes are in RAM and eventually get synced to disk since MongoDB uses a memory mapped files. Writes are not as expensive as if you were always waiting to sync to disk. Speed can mitigate concurrency issues.

              This is not to say that MongoDB will never have some shortcomings and engineering tradeoffs. Also, you can, will and sometimes should combine MongoDB with a relational database or a full text search like Solr/Lucene for some applications. For example, if you run into issue with effectively building indexes to speed some queries you might need to combine MongoDB with Memcached. None of this is completely foreign though, as it is not uncommon to pair RDBMS with Memcached or Lucene/Solr. When to use MongoDB and when to develop a hybrid solution is beyond the scope of this article. In fact, when to use a SQL/RDBMS or another NoSQL solution is beyond the scope of this article, but it would be nice to hear more discussion on this topic.

              The price you pay for MongoDB, one of the youngest but perhaps best managed NoSQL solution, is lack of maturity. It does not have a code base going back three decades like RDBMS systems. It does not have tons and tons of third party management and development tools. There have been issues, there are issues and there will be issues, but MongoDB seems to work well for a broad class of applications, and is rapidly addressing many issues.

              Also finding skilled developers, ops (admins, devops, etc.) that are familiar with MongoDB or other NoSQL solutions might be tough. Somehow MongoDB seems to be the most approachable or perhaps just best marketed. Having worked on projects that used large NoSQL deployments, few people on the team really understand the product (limitations, etc.), which leads to trouble. 

              In short if you are kicking the tires of NoSQL, starting with MongoDB makes a lot of sense.

              If you would like to learn more about MongoDB consider the following resources:

              MongoDB Database as a Gateway Drug to NoSQL

              posted May 25, 2012, 2:21 PM by Rick Hightower   [ updated Sep 30, 2012, 3:19 PM ]

              MongoDB combinations of features, simplicity, community, and documentation make it successful. The product itself has high availability, journaling (which is not always a given with NoSQL solutions), replication, auto-sharding, map reduce, and an aggregation framework (so you don't have to use map-reduce directly for simple aggregations). MongoDB can scale reads as well as writes.

              MongoDB is a good way to answer that question that has been nagging you, namely, What is NoSQL?

              NoSQL, in general, has been reported to be more agile than full RDBMS/ SQL due to problems with schema migration of SQL based systems. Having been on large RDBMS systems and witnessing the trouble and toil of doing SQL schema migrations, I can tell you that this is a real pain to deal with. RDBMS / SQL often require a lot of upfront design or a lot schema migration later. In this way, NoSQL is viewed to be more agile in that it allows the applications worry about differences in versioning instead of forcing schema migration and larger upfront designs. To the MongoDB crowd, it is said that MongoDB has dynamic schema not no schema (sort of like the dynamic language versus untyped language argument from Ruby, Python, etc. developers).

              MongoDB does not seem to require a lot of ramp up time. Their early success may be attributed to the quality and ease-of-use of their client drivers, which was more of an afterthought for other NoSQL solutions ("Hey here is our REST or XYZ wire protocol, deal with it yourself"). Compared to other NoSQL solution it has been said that MongoDB is easier to get started. Also with MongoDB many DevOps things come cheaply or free. This is not that there are never any problems or one should not do capacity planning. MongoDB has become for many an easy on ramp for NoSQL, a gateway drug if you will.

              MongoDB was built to be fast. Speed is a good reason to pick MongoDB. Raw speed shaped architecture of MongoDB. Data is stored in memory using memory mapped files. This means that the virtual memory manager, a very highly optimized system function of modern operating systems, does the paging/caching. MongoDB also pads areas around documents so that they can be modified in place, making updates less expensive. MongoDB uses a binary protocol instead of REST like some other implementations. Also, data is stored in a binary format instead of text (JSON, XML), which could speed writes and reads.

              Another reason MongoDB may do well is because it is easy to scale out reads and writes with replica sets and autosharding. You might expect if MongoDB is so great that there would be a lot of big names using them, and there are like: MTV, Craigslist, Disney, Shutterfly, Foursqaure,, The New York Times, Barclay’s, The Guardian, SAP, Forbes, National Archives UK, Intuit, github, LexisNexis and many more.

              If you would like to learn more about MongoDB consider the following resources:

              Introduction to NoSQL Architecture with MongoDB Database (What is NoSQL?)

              posted May 25, 2012, 12:04 PM by Rick Hightower   [ updated Sep 30, 2012, 3:17 PM ]

              Introduction to NoSQL Architecture with MongoDB (What is NoSQL?)

              Using MongoDB is a good way to get started with NoSQL and find out what NoSQL is. Using MongoDB concepts introduces concepts that are common in other NoSQL solutions. 

              What is NoSQL? From no NoSQL to sure why not

              The first time I heard of something that actually could be classified as NoSQL was from Warner Onstine, he is currently working on some CouchDB articles for InfoQ. Warner was going on and on about how great CouchDB was. This was before the term NoSQL was coined. I was skeptical, and had just been on a project that was converted from an XML Document Database back to Oracle due to issues with the XML Database implementation. I did the conversion. I did not pick the XML Database solution, or decide to convert it to Oracle. I was just the consultant guy on the project (circa 2005) who did the work after the guy who picked the XML Database moved on and the production issues started to happen.

              This was my first document database. This bred skepticism and distrust of databases that were not established RDBMS (Oracle, MySQL, etc.). This incident did not create the skepticism. Let me explain.

              First there were all of the Object Oriented Database (OODB) folks for years preaching how it was going to be the next big thing. It did not happen yet. I hear 2013 will be the year of the OODB just like it was going to be 1997. Then there were the XML Database people preaching something very similar, which did not seem to happen either at least at the pervasive scale that NoSQL is happening.

              My take was, ignore this document oriented approach and NoSQL, see if it goes away. To be successful, it needs some community behind it, some clear use case wins, and some corporate muscle/marketing, and I will wait until then. Sure the big guys need something like Dynamo and BigTable, but it is a niche I assumed. Then there was BigTable, MapReduce, Google App Engine, Dynamo in the news with white papers. Then Hadoop, Cassandra, MongoDB, Membase, HBase, and the constant small but growing drum beat of change and innovation. Even skeptics have limits.

              Then in 2009, Eric Evans coined the term NoSQL to describe the growing list of open-source distributed databases. Now there is this NoSQL movement-three years in and counting. Like Ajax, giving something a name seems to inspire its growth, or perhaps we don't name movements until there is already a ground swell. Either way having a name like NoSQL with a common vision is important to changing the world, and you can see the community, use case wins, and corporate marketing muscle behind NoSQL. It has gone beyond the buzz stage. Also in 2009 was the first project that I worked on that had mass scale out requirements that was using something that is classified as part of NoSQL.

              2009 was when MongoDB was released from 10Gen, the NoSQL movement was in full swing. Somehow MongoDB managed to move to the front of the pack in terms of mindshare followed closely by Cassandra and others (see figure 1). MongoDB is listed as a top job trend on, #2 to be exact (behind HTML 5 and before iOS), which is fairly impressive given MongoDB was a relativly latecomer to the NoSQL party.


              Figure 1: What is NoSQL? : MongoDB leads the NoSQL pack

              MongoDB takes early lead in NoSQL adoption race.

              MongoDB is a distributed document-oriented, schema-less storage solution similar to CouchBase and CouchDB. MongoDB uses JSON-style documents to represent, query and modify data. Internally data is stored in BSON (binary JSON). MongoDB's closest cousins seem to be CouchDB/Couchbase. MongoDB supports many clients/languages, namely, Python, PHP, Java, Ruby, C++, etc. This article is going to introduce key MongoDB concepts and then show basic CRUD/Query examples in JavaScript (part of MongoDB console application), Java, PHP and Python.

              Disclaimer: I have no ties with the MongoDB community and no vested interests in their success or failure. I am not an advocate. I merely started to write about MongoDB because they seem to be the most successful, seem to have the most momentum for now, and in many ways typify the very diverse NoSQL market. MongoDB success is largely due to having easy-to-use, familiar tools. I'd love to write about CouchDB, Cassandra, CouchBase, Redis, HBase or number of NoSQL solution if there was just more hours in the day or stronger coffee or if coffee somehow extended how much time I hadRedis seems truly fascinating.

              MongoDB seems to have the right mix of features and ease-of-use, and has become a prototypical example of what a NoSQL solution should look like. MongoDB can be used as sort of base of knowledge to understand other solutions (compare/contrast). This article is not an endorsement. Other than this, if you want to get started with NoSQL, MongoDB is a great choice.

              If you would like to learn more about MongoDB consider the following resources:

              1-10 of 10