COVID-19 Ingestion, Modeling, and Analysis

Dana M. Brannon, The University of Texas at Austin

This repository shows the process of ingesting COVID-19 data from the Johns Hopkins dataset into Google Cloud Platform's BigQuery, modeling the data, and finally performing an analysis on the data and generating a report.

This was an exercise we did in my Elements of Databases class, taught by Professor Shirley Cohen, who is a Solutions Architect at Google.

Summary of the notebooks

Ingest the data
1. Download data from Johns Hopkins repo into a custom GCP bucket
2. Create a BigQuery dataset and load the files into tables
3. Inspect the data and merge the tables
Model the data
1. Create a modeling dataset
2. Implement a location based Primary Key
3. Split the table into Location and Event tables
4. Standardize Event table using SQL
5. Standardize Location table using Beam
Analyze the data
1. Explore the modeled tables
2. Create views for the data we want to visualize
3. Access those views inside Data Studio to create some cool charts!

Using these notebooks

If you would like to perform your own analysis, feel free to clone this repository. First, you must have access to the Google Cloud Platform console. You also need to create a bucket in GCP's Storage Browser (mine is called covid-19-johnshopkins, which you can change to match your custom bucket name in the Ingestion notebook.)

You should be comfortable with SQL as well as Apache Beam in order to use these notebooks. After modeling the data in appropriate tables, you can then use GCP's Data Studio to create custom visualizations and reports.

Happy coding!

Data Studio Visual Report

Here are the results of the analysis, charted using GCP's Data Studio. You can see in late March and parts of April that there are portions of the data missing. I wanted to represent the original data as-is, but it may make sense in some scenarios to fill in that missing data with an average of the points closest to it so that readers are less likely to be confused or misinterpret the data. Data manipulation like this is entirely based on the specific situation of data representation.

China Data

Italy Data

US Data

Setting up Beam and Dataflow within GCP

Here are the steps on getting your Cloud Platform environment set up with Apache Beam and Dataflow.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
screenshots		screenshots
Location_beam.py		Location_beam.py
Location_beam_dataflow.py		Location_beam_dataflow.py
README.md		README.md
covid_19_analysis.ipynb		covid_19_analysis.ipynb
covid_19_ingestion.ipynb		covid_19_ingestion.ipynb
covid_19_modeling.ipynb		covid_19_modeling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-19 Ingestion, Modeling, and Analysis

Summary of the notebooks

Using these notebooks

Data Studio Visual Report

China Data

Italy Data

US Data

Setting up Beam and Dataflow within GCP

About

Releases

Packages

Languages

dmbrannon/COVID-19-pipeline

Folders and files

Latest commit

History

Repository files navigation

COVID-19 Ingestion, Modeling, and Analysis

Summary of the notebooks

Using these notebooks

Data Studio Visual Report

China Data

Italy Data

US Data

Setting up Beam and Dataflow within GCP

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages