MongoDB is a popular noSQL database. It's loose structure makes it well suited for capturing unstructured data, such as that encountered in web scraping. This sprint will focus on getting up and running with this system. This is intended to be an individual sprint.
You should already have a local Mongo docker container. Let's practice our AWS skills by spinning up a micro instance and practice installing on a remote machine.
-
To install MongoDB, use your operating system's package manager:
- Ubuntu Linux:
sudo apt-get install mongodb
- Ubuntu Linux:
-
Much like Postgres, you will need to launch the server before using Mongo for the first time.
- Ubuntu Linux:
sudo /etc/init.d/mongodb start
- Ubuntu Linux:
-
Check your installation by opening the MongoDB Client:
- Open a new terminal and type
mongo
to open up a Mongo shell - Type
show dbs;
to show the databases you have - You can exit by typing
exit
- Open a new terminal and type
-
Resources and quick references to Mongo commands:
You don't need to install locally, however, if you prefer not to use Docker here are the Mac steps.
-
Install Mongo:
- Mac OS X:
brew install mongodb
- Mac OS X:
-
Launch the server (note, you'll want to not have Docker running):
- Mac OS X:
brew services start mongodb
- Mac OS X:
To get familiar with MongoDB, we are going to load in some click-log data from a government website and do some basic queries on it. Write your queries in a text file. Paste and run the queries in the Mongo shell.
-
In your terminal, navigate to the
data
directory in the web-scraping repository and load in the data withmongoimport --db clicks --collection log < click_log.json
-
In the Mongo shell, run
show dbs;
to make sure theclicks
database has been created. Runuse clicks;
to use theclicks
database for your queries. -
Inspect the
log
collection in your database. How many entries are in thelog
collection?If you are not sure about what command to use, you can access the help section by:
help
db.help()
db.<collection_name>.help()
Mongo also has tab complete, so you can tab complete some of your commands for convenience.
-
Print out all of the clicks you have stored using
.find()
. Now using.limit()
, return 10 entries. You can also use.findOne()
to quickly view the first row and examine the available columns. -
Use
.find()
to find all the clicks wherecy
(city) isSan Francisco
. How many are there? -
Use
.distinct()
to find all the distinct types of web browsers (under the fielda
) people use to visit the sites. Count the the number of distinct web browsers (use.length
on your distinct list). -
Select and count the records where the users have visited a website either from a
Mozilla
or anOpera
web browser. Search thea
field using regex in mongo. -
Find the type of the
t
(timestamp) field. You can access the type of a field in an entry by usingtypeof db.log.findOne({'t': {$exists: true}}).t
. The field should be anumber
now.Convert the timestamp field to the date type. You will need to multiply the number by 1000 and then make it a
Date
object (you can create aDate
object by usingnew Date()
). You can loop over each record using.forEach()
and then.update()
the record (using the_id
field) with the createdDate
object. When you're done, confirm that the data type has been converted. Below is some template code.db.log.find({'t': {$exists: true}}).forEach(function(entry) { // your code to update an entry by _id and set the t field as a new // Date() object })
-
Sort the clicks by the timestamp and find when the first click occurred. How many clicks occurred in the first hour? To answer this, assign the earliest timestamp and timestamp at the one-hour bound to separate variables before writing the query.
-
Using Mongo's aggregation functionality, can you find what the most popular link clicked is? You will need to use
$group
,$sum
, and$sort
.
MongoDB actually has some geospatial facilities (don't worry, PostGreSQL has even better ones). Using the geoindices and Mongo queries, find the following:
- All clicks within 50 miles of San Francisco
- All clicks that came from New England
CartoDB happens to be one of my favorite tools for geospatial analysis (with built in PostGIS querying). Map the clicks across the globe. Visualize clicks over time with a torque map.
Here are some additional GUI clients if you so want to try (my favorite is RoboMongo):
- Robomongo (Multiplatform)
- MongoHub (Mac OSX) with down-loadable binary
- Humongous (web based)