Learning NOSQL and MongoDB / by Robert Walker

Given the rather stagnant advancements in database technologies over the last couple decades, I was super excited to start learning NOSQL and the various implementations of big data stores.  Here I describe how and what I learned and some tips to get you started.  This is by no means a be-all/end-all post on NOSQL and doesn't touch on NewSQL (i.e., Google Cloud Spanner) at all.

Learning

First I had to learn.  What is NOSQL?  Why does NOSQL exist?  What are the applications of NOSQL? Relation vs Non-Relational?  Why are there so many "kinds" of NOSQL?  

So let's go through these:

What is NOSQL?  NOSQL (short for Not-Only SQL) is best described as a list of data - sometimes with nested lists.  Generally, these "lists" take on some form of array or dictionary structure.

Why does NOSQL exist?  NOSQL exists because lists are generally much faster to navigate, parse, and manipulate than SQL tables.  Mostly this speed advantage comes you're not waiting on a bunch of joins to match up.  Also, with properly set index columns or keys, the items can be found from memory - making the lookup near instant.  So, in the world of ever increasing data and the need for consumers and analysts of that data to have "instant" access, NOSQL was the solution.

What are the applications of NOSQL?  Website comments, time-series data, real-time streaming data, and more.  NOSQL can/should be used whenever "instant" access is a priority over transactional security.  Generally, this applies to web-applications.

Relational vs Non-Relational?  So for decades we used almost exclusively relational database management systems (RDMS) like SQL Server, MySQL, Access, etc.  These are great for data storage and transaction management and they follow the ACID standards.  However, if the data are large or there are many joins to conduct, even with proper indexing, the time to return a result can be too long for the analyst or consumer of the data.  So along came non-relational database aka NOSQL (though some NOSQL variations do support some joins).  These are quick because there are no joins - just find the data, maybe get a child value and done. The down-side?  No ACID - you could have duplicates, typos, asychronous transactions, etc. Depending on your application - you may or may not care.  For example, have you ever seen the same comment posted twice by the same person on a news article?  An RDMS could prevent that using primary keys - but you'd have a list of users and a list of comments and possibly another list of news articles to join together first - taking a long time, so comments get posted slower - especially if hundreds or thousands of users are trying to post at the same time.  NOSQL doesn't care - post away - It'll just append your comment, duplicate or not, to the news article and move on (it's likely a moderate or bot will come along and clean those up later - but it's more important to just get the comment in for a website and worry about clean up later than to not capture the comment at all because the RDMS is backed-up).  So choose your database wisely depending on the purpose.

Why are there so many "kinds" of NOSQL?  Like all technologies, NOSQL has diverged in a number of niche types and applications each designed to support a specific kind of data.  The "kinds" are: Key-Value, Column-Oriented, Document-Oriented, Graph Database, and Relational Database.  Depending on the goals of your application and the data you are planning to store and retrieve you may use one or more kinds of NOSQL databases.  Each kind has several "brands" behind it with names like Cassandra, Hadoop, MongoDB, and others.  

So how did I spend time learning NOSQL? Online courses, google, Wikipedia, technical articles, YouTube, help documentation, etc.  Use whatever you can find - a lot of information is free on the internet.  

Implementation

So after I had learned enough information about NOSQL to put my knowledge into application, I had to 1) have some data, 2) build the database, and 3) have a host for my database.

Number 1 was easy - A client had just given me about three years of transducer data recorded every 15-minutes at several wells.  That is over 880,000 records. 

Number 2 took some research, and because I knew I was building a web application with dynamic charting I turned to MongooseJS and MongoDB.  After reading the documentation and setting up the server, I was able to create a model schema for my database that looks like this:

// data model
var mongoose = require('mongoose'),
Schema = mongoose.Schema;

// Schema for our transducer data
var TransducerData = new Schema({
LocationID: String,
LocAlias: [String],
LocCoords: {
X: Number,
Y: Number,
Z: Number,
SurveySource: String,
SurveyDate: Date
},
ReadingID: String,
DateTime: Date,
Pressure: Number,
Temp: Number,
BaroPressure: Number,
SensorDepth: Number,
CouplerDetached: String,
CouplerAttached: String,
HostConnected: String,
Stopped: String,
EndOfFile: String,
DateModified: Date
});
TransducerData.index({ "ReadingID": 1 });
TransducerData.index({ "LocationID": 1, "DateTime": 1 }, { unique: true, required: true });

//Create collection - no data added
module.exports = mongoose.model('transducerdata', TransducerData);

Finally - I need a host for the database that I could use reliably to retrieve the data.  I chose MLab because they have free 500MB storage for MongoDB NOSQL stores.

So that's about it - the rest is JavaScript to query the data, process the JSON response, and build the interactive charts!  No too bad!