This project demonstrates the design and implementation of a basic Data LakeHouse Architecture Pipeline using Google Cloud Platform (GCP) services. The pipeline is structured into six key layers: Orchestration, Ingestion, Storage, Process, Analytics, and Visualization. Each layer plays a critical role in transforming raw data into actionable insights, ensuring scalability, reliability, and automation at every stage of the data workflow.
We would like to invite you to explore our documentation guide, which outlines the setup steps for creating a simple Data Lakehouse in Google Cloud. Our team utilized tools like Cloud Functions, Google Cloud Storage, Dataproc, BigQuery, and Looker Studio to build this streamlined solution.
- Part 1 - Data Ingestion Layer with Cloud Function
- Part 2a - Data Processing Layer with Dataproc
- Part 2b - Data Processing Layer with BigQuery SQL in BigQuery
- Part 3 - Analytics Layer with BigQuery
- Part 4 - Orchestration Layer with Composer
- Part 5 - Data Governance Layer with Dataplex
- Part 6 - Visualization Layer with Looker Studio
And here is the a simple dataset about CO2 emissions in Canada we're performing data processing, analytics modeling and visualization with, using tools in Google Cloud!
Data is collected from various sources, including Kaggle, through automated Python scripts and cloud functions. This data is then stored securely in Google Cloud Storage, which serves as a scalable and durable data lake. The data undergoes ETL (Extract, Transform, Load) processing using Dataproc (for distributed data processing) and BigQuery SQL. Once the data is prepared, it is loaded into BigQuery for fast querying and analytics. Finally, insights are visualized through Looker Studio, providing stakeholders with interactive dashboards and reports to drive data-driven decisions.
The entire data pipeline is orchestrated using Google Cloud Composer, which is based on Apache Airflow. Composer automates and coordinates the key tasks of the pipeline:
- Data Ingestion Layer: Automates the ingestion of CO2 emissions data from Kaggle using the Kaggle API.
- Data Processing Layer: Manages the coordination between Dataproc (PySpark) and BigQuery SQL for data transformation and loading into BigQuery.
The ingestion layer collects raw data from various sources, including APIs and pre-existing datasets. For this project, the dataset on Carbon Dioxide (CO2) emissions by vehicles is fetched from Kaggle using the Kaggle API, and the process is automated using Google Cloud Functions. This ensures seamless, repeatable data collection and ingestion into Google Cloud Storage.
In the Data Storage Layer, raw data is securely stored in Google Cloud Storage, a cloud-based data lake solution. This provides a scalable and cost-effective way to store large datasets, ensuring they are easily accessible for further processing.
In the Data Processing Layer, the raw data undergoes cleansing, transformation, and preparation. Using Dataproc (PySpark), missing values are handled, duplicates are removed, and the data is transformed into a usable format. Afterward, the processed data is loaded into BigQuery for advanced analytics. The entire ETL workflow is automated and orchestrated by Google Cloud Composer.
The Data Analytics Layer leverages BigQuery for fast querying and data analysis. BigQuery supports large-scale analytics, allowing for in-depth analysis of CO2 emissions data. The analytics layer also supports BigQuery ML for machine learning model development, enabling the prediction of future trends and uncovering insights related to the data, such as the impact of fuel type on CO2 emissions.
Finally, the Data Visualization Layer uses Looker Studio to create interactive dashboards and reports, helping stakeholders visualize trends and insights derived from the analytics layer. For example, visualizations could display the vehicle models contributing to emissions trends or the impact of fuel consumption on CO2 emissions.
Layer | Tools | Uses |
---|---|---|
Orchestration | Google Cloud Composer | Automates task scheduling and coordinates data flow. |
Ingestion | Cloud Functions | Automates data collection from APIs, including Kaggle. |
Storage | Google Cloud Storage | Provides scalable, reliable, and cost-efficient data storage. |
Processing | Dataproc, BigQuery SQL | Distributed data processing (PySpark) and relational data management. |
Analytics | BigQuery, BigQuery ML | Fast querying, analytics, and machine learning capabilities. |
Data governance | Dataplex | Automated data catalogue, data lineage tracking and data quality scan |
Visualization | Looker Studio | Creates interactive dashboards and reports for data insights. |
This Data LakeHouse Architecture Pipeline is a modern, cloud-native solution that seamlessly integrates data ingestion, processing, analytics, and visualization. By leveraging the power of Google Cloud services, this pipeline ensures a robust, scalable, and efficient data workflow, capable of handling large datasets and providing valuable insights through advanced analytics and interactive visualizations.
Thank you, team, for your hard work, support, and collaboration:
- Bong Hui Xin
- Legawatthi A/P Thiyagarajan
- Nitya A/P Ponnusamy
- Nurfatin Syahirah Binti Sukaimi
- Tan Yong Sheng