- Sparkify startup requested to analyze user activities using song and user data.
- The project optimizes queries on song play analysis.
- It creates database schema and the Extract, Transform and Load (ETL) pipeline.
Data for song and user activities is in two directories data/log_data and data/song_data as JSON files.
- subset of real data from the Million Song Dataset.
- each file is in JSON format
- contains metadata about a song and the artist of that song.
- files are partitioned by the first three letters of each song's track ID.
- example of file paths to two files in song dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
- example of single song file, TRAABJL12903CDCF1A.json:
{
"num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0
}
- log files in JSON format generated by event simulator based on the songs in the song dataset.
- these simulate activity logs from a music streaming app based on specified configurations.
- log files are partitioned by year and month.
- example, here are filepaths to two files in log dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
- example of single log file, 2018-11-12-events.json:
{
"artist":null,
"auth":"Logged In",
"firstName":"Celeste",
"gender":"F",
"itemInSession":0,
"lastName":"Williams",
"length":null,
"level":"free",
"location":"Klamath Falls, OR",
"method":"GET",
"page":"Home",
"registration":1541077528796.0,
"sessionId":438,
"song":null,
"status":200,
"ts":1541990217796,
"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/37.0.2062.103 Safari\/537.36\"",
"userId":"53"
}
Relational database model with star schema that contains 1 fact table (songplays) and 4 dimension tables (users, songs, artists and time)
- records in log data associated with song plays i.e. records with page: 'NextSong'
Column | Type | Nullable |
---|---|---|
songplay_id (PK) | SERIAL | NOT NULL |
start_time | timestamp | NOT NULL |
user_id | int | NOT NULL |
level | varchar | |
song_id | varchar | |
artist_id | varchar | |
session_id | int | |
location | varchar | |
user_agent | varchar |
- users in the app
Column | Type | Nullable |
---|---|---|
user_id (PK) | int | NOT NULL |
first_name | varchar | |
last_name | varchar | |
gender | char(1) | |
level | varchar |
- songs in music database
Column | Type | Nullable |
---|---|---|
song_id (PK) | varchar | NOT NULL |
title | varchar | NOT NULL |
artist_id | varchar | NOT NULL |
year | int | |
duration | numeric | NOT NULL |
- artists in music database
Column | Type | Nullable |
---|---|---|
artist_id (PK) | varchar | NOT NULL |
name | varchar | NOT NULL |
location | varchar | |
latitude | float | |
longitude | float |
- timestamps of records in songplays broken down into specific units
Column | Type | Nullable |
---|---|---|
start_time (PK) | timestamp | NOT NULL |
hour | int | |
day | int | |
week | int | |
month | int | |
year | int | |
weekday | int |
- contains all the sql queries
- drops and creates the tables.
- run this file to reset the tables before each time you run the ETL scripts.
- reads and processes a single file from song_data and log_data
- loads the data into tables.
- contains detailed instructions on the ETL process for each of the tables.
- displays the first few rows of each table to check the database.
- reads and processes files from song_data and log_data
- loads data into the tables.
-
open terminal
-
run create_tables.py to drop and create tables:
python create_tables.py
- run etl.py to process files and load data:
python etl.py