Skip to content

Latest commit

 

History

History
137 lines (100 loc) · 6.1 KB

assignment.md

File metadata and controls

137 lines (100 loc) · 6.1 KB

MongoDB

MongoDB is a popular noSQL database. It's loose structure makes it well suited for capturing unstructured data, such as that encountered in web scraping. This sprint will focus on getting up and running with this system. This is intended to be an individual sprint.

AWS MongoDB Installation

You should already have a local Mongo docker container. Let's practice our AWS skills by spinning up a micro instance and practice installing on a remote machine.

  1. To install MongoDB, use your operating system's package manager:

    • Ubuntu Linux: sudo apt-get install mongodb
  2. Much like Postgres, you will need to launch the server before using Mongo for the first time.

    • Ubuntu Linux: sudo /etc/init.d/mongodb start
  3. Check your installation by opening the MongoDB Client:

    • Open a new terminal and type mongo to open up a Mongo shell
    • Type show dbs; to show the databases you have
    • You can exit by typing exit
  4. Resources and quick references to Mongo commands:

Mac Install (Optional)

You don't need to install locally, however, if you prefer not to use Docker here are the Mac steps.

  1. Install Mongo:

    • Mac OS X: brew install mongodb
  2. Launch the server (note, you'll want to not have Docker running):

    • Mac OS X: brew services start mongodb

Practicing Mongo Queries

To get familiar with MongoDB, we are going to load in some click-log data from a government website and do some basic queries on it. Write your queries in a text file. Paste and run the queries in the Mongo shell.

  1. In your terminal, navigate to the data directory in the web-scraping repository and load in the data with mongoimport --db clicks --collection log < click_log.json

  2. In the Mongo shell, run show dbs; to make sure the clicks database has been created. Run use clicks; to use the clicks database for your queries.

  3. Inspect the log collection in your database. How many entries are in the log collection?

    If you are not sure about what command to use, you can access the help section by:

    • help
    • db.help()
    • db.<collection_name>.help()

    Mongo also has tab complete, so you can tab complete some of your commands for convenience.

  4. Print out all of the clicks you have stored using .find(). Now using .limit(), return 10 entries. You can also use .findOne() to quickly view the first row and examine the available columns.

  5. Use .find() to find all the clicks where cy (city) is San Francisco. How many are there?

  6. Use .distinct() to find all the distinct types of web browsers (under the field a) people use to visit the sites. Count the the number of distinct web browsers (use .length on your distinct list).

  7. Select and count the records where the users have visited a website either from a Mozilla or an Opera web browser. Search the a field using regex in mongo.

  8. Find the type of the t (timestamp) field. You can access the type of a field in an entry by using typeof db.log.findOne({'t': {$exists: true}}).t. The field should be a number now.

    Convert the timestamp field to the date type. You will need to multiply the number by 1000 and then make it a Date object (you can create a Date object by using new Date()). You can loop over each record using .forEach() and then .update() the record (using the _id field) with the created Date object. When you're done, confirm that the data type has been converted. Below is some template code.

    db.log.find({'t': {$exists: true}}).forEach(function(entry) {
       // your code to update an entry by _id and set the t field as a new 
       //  Date() object
    })
  9. Sort the clicks by the timestamp and find when the first click occurred. How many clicks occurred in the first hour? To answer this, assign the earliest timestamp and timestamp at the one-hour bound to separate variables before writing the query.

  10. Using Mongo's aggregation functionality, can you find what the most popular link clicked is? You will need to use $group, $sum, and $sort.

Extra Credit

MongoDB actually has some geospatial facilities (don't worry, PostGreSQL has even better ones). Using the geoindices and Mongo queries, find the following:

  1. All clicks within 50 miles of San Francisco
  2. All clicks that came from New England

CartoDB

CartoDB happens to be one of my favorite tools for geospatial analysis (with built in PostGIS querying). Map the clicks across the globe. Visualize clicks over time with a torque map.

Additional GUI clients

Here are some additional GUI clients if you so want to try (my favorite is RoboMongo):