Necessity is the mother of innovation!
This is an old proverb, but it still holds damn good!
Last decade has pushed the boundaries of data generation, storage and analysis to an entirely new level. This push towards a digital data driven economy has created its own need. These problems and solutions are typically combined under the umbrella of Big Data.
Imagine this – Facebook and Google combined generate more data today, than the entire world would have generated a few years back. With this increase in data generation, comes the problem of data storage and scaling. All of us want our Facebook feeds to load instantaneously and hate the waiting time – but imagine the architecture you need to deliver that experience. Millions of users making simultaneous queries into your database in real time…phew! Add to this the unstructured nature of the data and need of a system, where you can add new features quickly – this would now be looking like an Herculean task.
Traditional databases find it hard to cope up with these requirements and the cost of scaling up becomes prohibitive! In this article, we’ll focus on one such innovation in data storage system popularly known as MongoDB. It provides schema-less design, high performance, high availability, and automatic scaling qualities which have now become a need and cannot be satisfactorily met by traditional RDBMS systems.
According to Wikipedia:
MongoDB (from humongous) is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software.
MongoDB is used across several companies in multiple domains (some of them shown below):
What can you learn from this guide?
In this guide, we’ll start by understanding the basic structural aspects powering MongoDB. The idea is to understand how MongoDB works. Specifically, we will look at these aspects:
- Data Model
We will also compare Traditional RDBMS vs NoSQL Databases to give you a better understanding of which works better, followed by the advantages and limitations of MongoDB.
Once we have a fair understanding of how MongoDB works, we will provide step by step guide to its installation. In the second part of this series, we will connect MongoDB to our analytics tools to provide a demo. For now, let’s start by understanding how MongoDB works.
Useful Read: NoSQL Databases explained in simple english!
Structural aspects of MongoDB
Let’s now understand the structural aspects of MongoDB in the order stated above:
1. Data Model
MongoDB stores data in the form of BSON -Binary encoded JSON documents which supports a rich collection of types. Fields in BSON documents may hold arrays of values or embedded documents. In MongoDB, the database construct is a group of related collections. Each database has a distinct set of data files and can contain a large number of collections. A single MongoDB deployment may have many databases.
What is a ‘document’ in Mongo DB?
A record in MongoDB is a document (shown below), which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.This is an important differentiation from RDBMS systems where each field must contain only one value.
What are ‘collections’ in Mongo DB?
MongoDB stores documents in collections (shown below). Collections are analogous to tables in relational databases. In RDMS all tables in a database must have the same schema, but in MongoDB there is no such requirement. This schema-less design is an innovation which makes MongoDB the most used NoSQL Database. However, documents stored in a collection must have a unique _id field that acts as a primary key.
Documents in a collection can be stored either in Normalized for or embedded into another document itself. Let’s understand the difference in detail:
a) Normalized Data Models
The relationships between data is stored by links (references) from one document to another (shown below). These references are resolved by the application to fetch the related data.
b) Embedded Data Models
Embedded documents store relationships between data by storing related data in a single document structure (shown below). These denormalized data models allow applications to retrieve and manipulate related data in a single database operation.
GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB.
Instead of storing a file in a single document, GridFS divides a file into parts, and stores each part as a separate document. GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata (shown below).
When we query a GridFS store for a file, the client reassembles the chunks as needed. Information can also be accessed from any random section/s of files. This feature is what basically allows for “skipping” into the middle of a video or audio file.
Database systems with large data sets and high throughput applications can challenge the capacity of a single server in multiple ways such as:
- High query rates put stress on the CPU capacity of the server.
- Larger data sets exceed the storage capacity of a single machine.
- Dataset sizes larger than the system’s RAM stress the I/O capacity of disk drives.
To address these issues of scale, database systems have two basic approaches:
- Vertical Scaling
- Sharding or Horizontal Scaling
a) Vertical scaling: It adds more CPU and storage resources to increase capacity. But such arrangements are disproportionately expensive. As a result there is a practical maximum capability for vertical scaling.
b) Sharding or Horizontal Scaling: By contrast, it divides the data set and distributes the data over multiple servers-shards. Each shard is an independent database and collectively shards make up a single database.
MongoDB supports sharding through the configuration of sharded clusters. Process of sharing has been explained in the image below where:
- Shards are used to store the data.
- Query Routers, or mongos instances, interface with client applications and direct operations to the appropriate shard or shards and then returns results to the clients.
- Config servers stores the cluster’s metadata. This data contains a mapping of the cluster’s data set to the shards. The query router uses this metadata to target operations to specific shards.
4. Data partitioning
MongoDB distributes data at the collection level. Sharding partitions a collection’s data by the shard key.
What is a shard key?
A shard key is either an indexed field or an indexed compound field that exists in every document in the collection. MongoDB divides the shard key values into chunks and distributes the chunks evenly across the shards. To divide the shard key values into chunks, MongoDB uses either range based partitioning or hash based partitioning.
a) Range Based Sharding
Consider a numeric shard key: If you visualize a number line that goes from negative infinity to positive infinity, each value of the shard key falls at some point on that line. MongoDB partitions this line into smaller, non-overlapping ranges called chunks. It is a range of values from some minimum value to some maximum value (shown below).
In a range based partitioning system, documents with “close” shard key values are most probably in the same chunk, and thus on the same shard.
b) Hash Based Sharding:
For hash based partitioning, MongoDB computes a hash -A hash value is a numeric value of a fixed length that uniquely identifies data. These values represent large amounts of data as much smaller numeric values of a field’s value, and then uses these hashes to create chunks (shown below).
With hash based partitioning, two documents with “close” shard key values are unlikely to be part of the same chunk. This ensures a more random distribution of a collection in the cluster.
Aggregations are operations that process data records and return computed results. Unlike queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents. MapReduce is a tool used for aggregating data.
What is an Aggregation Pipeline?
An aggregation pipeline is a series of document transformations which are executed in stages. The original input is a collection whereas the output can be a document,cursor or a collection (shown below).
The most basic pipeline stages provide filters that operate like queries and document transformations that modify the form of the output document.
Other pipeline operations provide tools for grouping and sorting documents by specific field or fields as well as tools for aggregating the contents of arrays, including arrays of documents. In addition, pipeline stages can use operators for tasks such as calculating the average or concatenating a string.
MapReduce is a powerful and flexible tool for aggregating data. It can solve problems which are complex in nature and express using the aggregation framework query language.
It splits up a problem, sends chunks of it to different machines, and lets each machine solve its part of the problem. When all the machines are finished, all the pieces of the solution are merged back into a full solution.
b) Single Purpose Aggregation Operations
For a number of common single purpose aggregation operations like returning a count of matching documents, returning the distinct values for a field, and grouping data based on the values of a field; MongoDB provides special purpose database commands.
All of these operations aggregate documents from a single collection. Though these operations provide simple access to common aggregation processes, they lack the flexibility and capabilities of the aggregation pipeline and MapReduce.
Indexes are special data structures that store a small portion of the collection’s data set in an easy to traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field.
The ordering of the index entries supports efficient equality matches and range-based query operations. In addition, MongoDB can return sorted results by using the ordering in the index. The following diagram illustrates a query that selects and orders the matching documents using an index:
Indexes are used for better query performance. They are created on fields which appear often in queries(_id) and for operations that return sorted results. MongoDB automatically creates a unique index on the _id field. Indexes have the following properties in MongoDB:
- Each index requires at least 8KB of data space.
- Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive since each insert must also update any indexes.
- Collections with high read-to-write ratio often benefit from additional indexes.
- When active, each index consumes disk space and memory. This usage grows over time can becomes significant. Perhaps, for better server space and performance management, it is good practice to track the growth of indexes.
Indexes support the efficient execution of queries. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication protects a database from the loss of a single server allows for recovery from hardware failure and service interruptions.
What is a replica?
A replica set is a group of mongodb instances that host the same data set. One mongodb, the primary, receives all write operations. All other instances, secondaries, apply operations from the primary so that they have the same data set (shown below).
The primary accepts all write operations from clients. A replica set can have only one primary. To support replication, the primary records all changes to its data sets in its oplog (operations log).
The secondaries replicate the primary’s oplog and apply the operations to their data sets such that the secondaries data sets reflect the primary’s data set. If the primary is unavailable, the replica set will elect a secondary to be primary. When a primary does not communicate with the other members of the set for more than 10 seconds, the replica set will attempt to select another member to become the new primary. The first secondary that receives a majority of votes becomes a primary(shown below).
COMPARISON: Traditional RDBMS vs NoSQL Databases
Comparing NoSQL and MongoDB is like comparing a Lion with a Tiger. Yet, both are predators, one hunts alone and the other in packs.
SQL (tiger) has a rigid data model which needs data to conform to the design of the schema. It is useful for organizing structured data like sales statistics. On the other hand, MongoDB (lion) is a document oriented database, which stores data in the form of documents. Though their approaches are different, both are required for data storage and the selection of the database type depends rather on the organizational need.
Useful Read: Basics of SQL and RDBMS – A must have skills for data science professional
What are the advantages of using MongoDB ?
As you can see from the above representation, when the number of queries hitting the server increases, MongoDB is a clear winner. MongoDB is typically used for real-time analytics where latency is low and availability requirements very high.
MongoDB has come to the forefront because of the need of organizations to analyze semi-structured, unstructured and geo-spatial data and because the structure of data is rapidly changing in today’s world. Traditional RDBMS systems are unable to cope with these demands fully as their inherent structure does not allow them do so.
Though changes are being made in RDBMS systems too, to cope with the explosion of data, databases like MongoDB with their document structure are best suited for dealing with today’s data.
What are the limitations of MongoDB?
MongoDB has some limitations which are listed below.
- Max document size is 16 MB.
- Max document nesting level: 100 (documents inside documents inside documents).
- Indexed field can’t contain more than 1024 bytes.
- Max 64 indexes per collection.
- Max 31 fields can be used to create a compound index.
- Full-text search and geo indexes are mutually exclusive.
- Limit of documents in a capped collection can’t be more than 2**32. Otherwise, number of documents is unlimited.
- On windows, mongodb can’t store more than 4 TB of data (8 TB without journal)
- Max 12 nodes in a replica set.
- Max 7 voting nodes in a replica set.
- To rollback more than 300 MB of data manual intervention is needed.
- Group command doesn’t work in sharded cluster.
- $isolated, $snapshot, geoSearch don’t work in a sharded cluster.
- You can’t refer to db object in $where
- For sharding a collection it must be less than 256 GB.
- Individual (not multi) updates/removes in a sharded cluster must include shard key. Multi versions of these commands may not include shard key.
- Max 512 bytes for shard key values.
- Shard key values of a collection cannot be changed once sharding is done.
Apart from these, prevention of accidental deletion of records due to constraints in RDBMS systems cannot be implemented in MongoDB or other NoSQL systems. Also there might be other problems like the one shown below, for storing multi-layered data without normalization:
A user has friends who might be a user himself.People who have liked or commented or both can again be users themselves. This type of duplication makes it way harder to de-normalize an activity stream into a single document.
MongoDB also has it’s fair share of limitations and disadvantages and just like any other technology, with improvements they will be hopefully removed.
Installation of Mongo & its admin GUI:
Follow the 7 steps below and complete the installation process of MongoDB:
Step 1: Download MongoDB from MongoDBDownload. Click Download and save it on your machine. You can also select the version according to the OS you use.
Step 2: In case of Windows, locate the downloaded MongoDB .msi file, which typically is located in the default Downloads folder. Double-click the .msi file. A set of screens will appear to guide you through the installation process.
Setup the MongoDB environment:
Step 3: MongoDB requires a data directory to store all data. Its default data directory path is\data\db. Create this folder using the following commands from a Command Prompt:md \data\db.
By default, this folder gets created in the C: drive.
Step 4: Navigate to the bin folder where the mongod.exe file is located and run the following command in the cmd “C:\Program Files\MongoDB\Server\3.0\bin\mongod.exe”. This should give an output as shown below:
The waiting for connections message indicates MongoDB is running successfully.
Notice the part highlighted in white color; if you do not get this message, it means you haven’t downloaded and installed hotfix prior to running MongoDB.
Connect to MongoDB:
Start 5: To connect to MongoDB, open another command prompt window and type:
Note: The path is the location of mongo.exe file.
This should give the following message in the cmd window(mongo shell):
Step 6: Download NoSQL Manager for MongoDB from MongoDBManager. This is much like SQL server management studio and I will use this for the purpose of illustration in the article.
Step 7: Click on localhost. This should establish a connection with the instance of MongoDB and the interface will look like as shown below:
More Admin GUI can be found at: mongoDB admin GUI
With this we complete the installation of MongoDB and its admin GUI.
The structural components of MongoDB like data storage in the form of documents and collections, sharding, replication etc. makes it the most widely used No SQL database today. MongoDB also has API’s for connecting with programming languages like Perl,Ruby,Python and R which further makes it attractive to developers and analysts alike. We will be sharing some of these details in one of the future posts.
Did you find this guide useful ? Do let us know your thoughts about this guide in the comments section below.
If you want to learn more about MongoDB you can consider Data Wrangling With MongoDB from Udacity. This will require knowledge of Python.
References : MongoDB manual.
If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.
MongoDB Commands Cheat Sheet
In this article, we will try to briefly discuss the commands available with MongoDB version v3.4.7 (which is the current stable version of MongoDB as on date of this article write-up) and this reference might not completely hold well with the earlier versions or the future versions of MongoDB. If any of the commands seem not to work, it is highly suggested to go through the Official MongoDB documentation or cheat sheets specific to the versions of MongoDB that you are going to work with.
MongoDB Commands Listing (Common Line Options):
There are numerous options provided by MongoDB but in this section, let us discuss the most common options of MongoDB when used in conjunction with the MongoDB shell startup, to work with MongoDB database server seamlessly.
|1||--help||Lists all the available options that can be used while starting up the MongoDB Shell.|
|2||--nodb||Specifies to start the MongoDB shell connecting to any database.|
|3||--shell||Specifies to start the shell after running any specific *.js files earlier.|
|4||--version||Specifies the version information of the MongoDB shell during start up.|
|5||--quiet||Starts the MongoDB shell with not many chatty messages.|
Checkout MongoDB Interview Questions
MongoDB Commands Listing (Command Helpers):
There are various command helpers available for the MongoDB shell (mongo). The table below talks about the most commonly used help commands.
|S.No.||Help Commands and Commands||Description|
|1||help||Shows help related information on the MongoDB shell.|
|2||db.help()||Shows help related information on the database methods.|
|3||db..help||can be an existing collection or not, but provides help related information on database collections.|
|4||show dbs||Lists all the databases available for use on the connected MongoDB instance.|
|5||show databases||Lists all the databases available for use on the connected MongoDB instance.|
|6||use||Specifies the MongoDB shell to switch to the database provided with parameter, switches the shell parameter (db) to it.|
|7||show collections||Lists all the collections available for use on the current database.|
|8||show users||Lists all the users available on the current database.|
|9||show roles||Lists all the roles (both built-in roles and user-defined roles) on the current database.|
|10||show profile||Lists the last 5 recent operations that took 1 millisecond or more.|
|11||load()||Executes a specified java script file.|
Checkout MongoDB Tutorial
MongoDB Commands Listing (Administrative Command Helpers):
The sections above provide helpful commands that can get you online on MongoDB and do some administrative tasks on MongoDB. This section specifically works upon the administrative tasks on databases created within MongoDB database server. The most commonly used commands are presented below in the table, take a look:
|S.No.||JS Database Administration Method||Description|
This command helps you to clone the current database from the mentioned in the command.The only requirement for this to work, is to have database instance in a noauth mode.
|2||db.copyDatabase(, , )|
This command helps you to copy the database provided in the clause from the clause to the database on the current server.The only requirement for this to work, is to have database instance in a noauth mode.
|3||db.fromCollection.renameCollection(toCollection)||As suggested from the method name, the collection named ‘fromCollection’ is being renamed to ‘toCollection’.|
|4||db.repairDatabase()||This command repairs and compacts the current database. This operation can be very slow if it is performed on a huge database.|
|5||db.getCollectionNames()||Provides a list of collections available in the current database as a list.|
|6||db.dropDatabase()||This command drops the current database.|
MongoDB Commands Listing (Basic Shell JS Operations):
Apart from the above help options and command line options, MongoDB provides a rich collection of Java script API for database related operations. db is the mongo shell variable that holds the current database that we are pointing to. The variable is reset to the when the command use is used, until then it points to the test database.
Some of the commonly used Java script operations are discussed in the table below.
|S.No.||JS Database Operations||Description|
|1||db.auth()||Authenticates the user, if the MongoDB shell is running in the secure mode.|
|2||myCollectionVariable = db.|
Assign a specific collection ‘myCollection’ to the variable ‘myCollectionVariable’, as shown in the example below:
myCollectionVariable = db.myCollection;
You can then perform operations on myCollection on the variable myCollectionVariable instead. See below for example
Finds all the documents in a specific collection and returns a cursor
From the example above, myCollectionVariable.find()
|4||db.collection.insertOne()||Inserts one document into the collection specified via the command.|
|5||db.collection.insertMany()||Inserts multiple documents into the collection specified via the command.|
|6||db.collection.updateOne()||Updates one single document in the collection specified via the command.|
|7||db.collection.updateMany()||Updates multiple documents in the collection specified via the command.|
|8||db.collection.save()||Inserts a new document into the collection if it doesn’t exists, and if it exists updates the document in the collection.|
|9||db.collection.deleteOne()||Deletes one document from the collection specified via the command.|
|10||db.collection.deleteMany()||Deletes multiple documents from the collection specified via the command.|
|11||db.collection.drop()||Drop or remove the collection entirely from the database|
|12||db.collection.createIndex()||This command creates a new index on the collection specified via the command, if it doesn’t exist. If the index already exists, then there is no effect of this command over the collection specified by the command.|
Checkout MongoDB Blogs
This article provides a one stop shop for all the administration related queries, help related commands and the basic CRUD operations that can be done on collections on a MongoDB database. As mentioned earlier during the introduction of the article, this is written for the latest stable release of MongoDB (version v3.4.7). Commands provided here may not work on other versions of MongoDB considering that there are many changes from the previous versions to the latest one.
Hope this article helps you in getting the basic gist of the much-needed commands in using MongoDB in your day to day tasks.