Skip to content

Latest commit

 

History

History
98 lines (66 loc) · 4.79 KB

04-data-milky-way-brief-history-part-3-data-processing.mdx

File metadata and controls

98 lines (66 loc) · 4.79 KB
sidebar_position authors minutesToComplete
4
plaosunthara
tklae
40

Data Milky Way: Brief History (Part 3) - Data Processing

<iframe width="560" height="315" src="https://www.youtube.com/embed/Uc-Wtem-lyw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="allowfullscreen"></iframe>

Evolution of Data Processing

map-reduce-processing.png data-processing-vision.png

Evaluate your own project’s needs. We will cover streaming in the second half of the course, so more material will be provided then.

Orchestration Core Concepts

But how do we make our pipeline flow? 🌊

  • Data Engineering workflows often involve transforming and transferring data from one place to another.

  • We want to combine data from different locations, and we want to do this in a way that is reproducible and scalable when there are updates to the data or to our workflows.

  • Workflows in real-life have multiple steps and stages. We want to be able to orchestrate these steps and stages and monitor the progress of our workflows.

  • Sometimes, everything might work fine with just CRON jobs.

  • But other times, you might want to control the state transitions of these steps:

    • e.g. if Step A doesn’t run properly, don’t run Step B because the data could be corrupt, instead run Step C.
    • Once again, the concept of Directed Acyclic Graphs (DAGs) can come to our rescue.
  • Bonus Content: Apache Airflow (32 minute video) is one nice way of setting up DAGs to orchestrate jobs 🌈

    • Note: Airflow is primarily designed as a task orchestration tool.
    • You can trigger tasks on the Airflow cluster itself or on remote targets (e.g. AWS Fargate, Databricks, etc.).
    • NOT designed for transferring large amounts of actual data.
    • Reference Documentation
    • Play around with Airflow locally (very optional!)

Practical Data Workloads

big-data-sword.png

We’re here to teach you big data skills, but in reality...

Single-Node vs. Cluster

Not everything is Big Data! You don’t always need Spark! Sometimes Pandas deployed on a single node function/container is just fine!.

Batch vs Streaming

Streaming isn’t always the solution! (Optional reading, ~10 minutes)

Orchestration options

Here, it's just useful to know a few of the names but going into detail is not necessary for the course.

DAG-based approaches:

Event-Driven + Declarative

Other triggers: