Skip to content

The purpose of the project is to efficiently collect, process, and store Twitter data using a combination of Apache Airflow, Apache Spark, and Amazon S3.

Notifications You must be signed in to change notification settings

Ruth-Mwangi/youtube-data-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

youtube-data-etl

The purpose of the project is to efficiently collect, process, and store Youtube data using a combination of Apache Airflow, Apache Spark, and MongoDB.

Prerequisites

Before setting up and running the YOutube Data Pipeline project, make sure you have the following prerequisites in place:

  1. Environment Setup:

    • Install and configure Apache Airflow.
    • Install and configure Apache Spark on the target machine or cluster.
  2. Youtube Developer Account:

    • Obtain YouTube API credentials.
  3. MongoDB:

    • Install MongoDB.
  4. Access and Permissions:

    • Grant necessary permissions for YouTube API access and AWS S3 resources.
  5. Data Schema Understanding:

    • Familiarize yourself with the structure of YouTube data returned by the API.
  6. Apache Airflow Plugins:

    • Identify and install required Airflow plugins based on project needs.
  7. Spark Job Configuration:

    • Develop Spark jobs and ensure the correct setup of dependencies and configurations.

Getting Started

Follow these steps to set up and run the Twitter Data Pipeline:

  1. Clone the repository.
  2. Install dependencies using the provided in the requirements.txt
    pip install -r requirements. txt
    
  3. Run the Airflow DAG to initiate the data pipeline.

About

The purpose of the project is to efficiently collect, process, and store Twitter data using a combination of Apache Airflow, Apache Spark, and Amazon S3.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages