Dana M. Brannon, The University of Texas at Austin
This repository shows the process of ingesting COVID-19 data from the Johns Hopkins dataset into Google Cloud Platform's BigQuery, modeling the data, and finally performing an analysis on the data and generating a report.
This was an exercise we did in my Elements of Databases class, taught by Professor Shirley Cohen, who is a Solutions Architect at Google.
- Ingest the data
- Download data from Johns Hopkins repo into a custom GCP bucket
- Create a BigQuery dataset and load the files into tables
- Inspect the data and merge the tables
- Model the data
- Create a modeling dataset
- Implement a location based Primary Key
- Split the table into Location and Event tables
- Standardize Event table using SQL
- Standardize Location table using Beam
- Analyze the data
- Explore the modeled tables
- Create views for the data we want to visualize
- Access those views inside Data Studio to create some cool charts!
If you would like to perform your own analysis, feel free to clone this repository. First, you must have access to the Google Cloud Platform console. You also need to create a bucket in GCP's Storage Browser (mine is called covid-19-johnshopkins
, which you can change to match your custom bucket name in the Ingestion notebook.)
You should be comfortable with SQL as well as Apache Beam in order to use these notebooks. After modeling the data in appropriate tables, you can then use GCP's Data Studio to create custom visualizations and reports.
Happy coding!
Here are the results of the analysis, charted using GCP's Data Studio. You can see in late March and parts of April that there are portions of the data missing. I wanted to represent the original data as-is, but it may make sense in some scenarios to fill in that missing data with an average of the points closest to it so that readers are less likely to be confused or misinterpret the data. Data manipulation like this is entirely based on the specific situation of data representation.
Here are the steps on getting your Cloud Platform environment set up with Apache Beam and Dataflow.