youtube-data-etl

The purpose of the project is to efficiently collect, process, and store Youtube data using a combination of Apache Airflow, Apache Spark, and MongoDB.

Prerequisites

Before setting up and running the YOutube Data Pipeline project, make sure you have the following prerequisites in place:

Environment Setup:
- Install and configure Apache Airflow.
- Install and configure Apache Spark on the target machine or cluster.
Youtube Developer Account:
- Obtain YouTube API credentials.
MongoDB:
- Install MongoDB.
Access and Permissions:
- Grant necessary permissions for YouTube API access and AWS S3 resources.
Data Schema Understanding:
- Familiarize yourself with the structure of YouTube data returned by the API.
Apache Airflow Plugins:
- Identify and install required Airflow plugins based on project needs.
Spark Job Configuration:
- Develop Spark jobs and ensure the correct setup of dependencies and configurations.

Getting Started

Follow these steps to set up and run the Twitter Data Pipeline:

Clone the repository.
Install dependencies using the provided in the requirements.txt
```
pip install -r requirements. txt
```
Run the Airflow DAG to initiate the data pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
local_airflow/dags		local_airflow/dags
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

youtube-data-etl

Prerequisites

Getting Started

About

Releases

Packages

Languages

Ruth-Mwangi/youtube-data-etl

Folders and files

Latest commit

History

Repository files navigation

youtube-data-etl

Prerequisites

Getting Started

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages