Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add helper functions for uploading target directory artifacts to remote cloud storages #1389

Merged
merged 10 commits into from
Dec 17, 2024

Conversation

pankajkoti
Copy link
Contributor

@pankajkoti pankajkoti commented Dec 13, 2024

This PR introduces helper functions that can be passed as callable callbacks for Cosmos tasks to execute post-task execution. These helper functions enable the uploading of artifacts (from the project's target directory) to various cloud storage providers, including AWS S3, Google Cloud Storage (GCS), Azure WASB, and general remote object stores using Airflow’s ObjectStoragePath.

Key Changes

Adds a cosmos/io.py module that includes the following helper functions

  1. upload_artifacts_to_aws_s3

    • Uploads artifact files from a task’s local target directory to an AWS S3 bucket.
    • Supports dynamically appending DAG metadata (e.g., dag_id, task_id, run_id, and try number) to the uploaded file paths.
    • Utilizes S3Hook from the airflow.providers.amazon.aws module.
  2. upload_artifacts_to_gcp_gs

    • Uploads artifact files from a task’s local target directory to a Google Cloud Storage (GCS) bucket.
    • Appends DAG-related context to the GCS object names for better traceability.
    • Leverages GCSHook from airflow.providers.google.cloud.
  3. upload_artifacts_to_azure_wasb

    • Uploads artifact files from a task’s local target directory to an Azure Blob Storage container.
    • Automatically structures blob names with metadata, including dag_id, task_id, and execution details.
    • Utilizes WasbHook from the airflow.providers.microsoft.azure module.
  4. upload_artifacts_to_cloud_storage

    • A generic helper function that uploads artifacts from a task’s local target directory to remote object stores configured via Airflow’s ObjectStoragePath (Airflow 2.8+ feature).
    • Supports custom remote storage configurations such as remote_target_path and remote_target_path_conn_id.
    • Dynamically constructs file paths that include DAG metadata for clear organization.

These helpers functions can be passed as the callback argument to DbtDAG or to your Dag instance as demonstrated in the example DAGs dev/dags/cosmos_callback_dag.py and dev/dags/example_operators.py correspondingly. You can also pass callback_args as shown in the example DAGs. These helper functions are mere examples of how callback functions can be written and passed to your operators/DAGs to be executed after task completions. Taking reference of these helper functions, you can write your own callback function and pass those.

Limitations

  1. This PR has been tested and is currently supported only in ExecutionMode.LOCAL. We encourage the community to contribute by adding callback support for other execution modes as needed, using the implementation for ExecutionMode.LOCAL as a reference.

closes: #1350
closes: #976
closes: #867
closes: #801
closes: #1292
closes: #851
closes: #1351
related: #1293
related: #1349

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

Copy link

cloudflare-workers-and-pages bot commented Dec 13, 2024

Deploying astronomer-cosmos with  Cloudflare Pages  Cloudflare Pages

Latest commit: ae83d6b
Status: ✅  Deploy successful!
Preview URL: https://46387f83.astronomer-cosmos.pages.dev
Branch Preview URL: https://artifacts-upload-helpers.astronomer-cosmos.pages.dev

View logs

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pankajkoti , it's amazing to see the progress here, we'll be able to close almost 10 tickets of our backlog. A minor feedback, while this is in draft mode

cosmos/helpers.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Dec 16, 2024

Codecov Report

Attention: Patch coverage is 98.75000% with 1 line in your changes missing coverage. Please review.

Project coverage is 96.28%. Comparing base (cbd8622) to head (ae83d6b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
cosmos/io.py 98.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1389      +/-   ##
==========================================
+ Coverage   96.21%   96.28%   +0.07%     
==========================================
  Files          67       68       +1     
  Lines        4071     4149      +78     
==========================================
+ Hits         3917     3995      +78     
  Misses        154      154              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pankajkoti pankajkoti force-pushed the artifacts-upload-helpers branch from 6138522 to e453233 Compare December 17, 2024 07:24
@pankajkoti pankajkoti force-pushed the artifacts-upload-helpers branch from 2a0a858 to 9d6fa8f Compare December 17, 2024 08:46
@pankajkoti pankajkoti force-pushed the artifacts-upload-helpers branch from 9d6fa8f to e4bb6a4 Compare December 17, 2024 09:01
@pankajkoti pankajkoti marked this pull request as ready for review December 17, 2024 09:46
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 17, 2024
@pankajkoti pankajkoti requested a review from tatiana December 17, 2024 09:46
@pankajkoti pankajkoti force-pushed the artifacts-upload-helpers branch from 1203b73 to ae83d6b Compare December 17, 2024 13:06
@pankajkoti pankajkoti added the execution:callback Tasks related to callback when executing tasks label Dec 17, 2024
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for addressing all the feedback, @pankajkoti !

This is probably a breaking record for a Cosmos PR: to close seven tickets in one go. Really excited to share this with the community in 1.8

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 17, 2024
@tatiana tatiana merged commit 0000f80 into main Dec 17, 2024
64 checks passed
@tatiana tatiana deleted the artifacts-upload-helpers branch December 17, 2024 13:53
@tatiana tatiana mentioned this pull request Dec 17, 2024
pankajkoti added a commit that referenced this pull request Dec 18, 2024
… in ExecutionMode.VIRTUALENV (#1401)

Following up on PR #1389 wherein we add helper functions to demonstrate
how to
utilise the callback functionality in `ExecutionMode.LOCAL`, I also
tested this functionality
with `ExecutionMode.VIRTUALENV` since the VirtualEnv mode operators
inherit the Local
mode operators and it does seem to support the callback functionality
well. I was well able
to test this by adding the callback arg to the VirtualEnv example DAG we
have and see that
the files do get uploaded to remote store. I have added the argument in
comments in the example
DAG so that users could take a reference of that, however I have left
the argument as commented
so that we do not always keep on uploading the files in our CI runs and
given that we already this
in the `example_operators.py` DAG.

The PR also updates the documentation to reflect that the callback
functionality is supported in
the `ExecutionMode.VIRTUALENV` too.

closes: #1399
tatiana added a commit that referenced this pull request Dec 20, 2024
**New Features**

* Support customizing Airflow operator arguments per dbt node by @wornjs
in #1339. [More
information](https://astronomer.github.io/astronomer-cosmos/getting_started/custom-airflow-properties.html).
* Support uploading dbt artifacts to remote cloud storages via callback
by @pankajkoti in #1389. [Read
more](https://astronomer.github.io/astronomer-cosmos/configuration/callbacks.html).
* Add support to ``TestBehavior.BUILD`` by @tatiana in #1377.
[Documentation](https://astronomer.github.io/astronomer-cosmos/configuration/testing-behavior.html).
* Add support for the "at" operator when using ``LoadMode.DBT_MANIFEST``
or ``CUSTOM`` by @benjy44 in #1372
* Add dbt clone operator by @pankajastro in #1326, as documented in
[here](https://astronomer.github.io/astronomer-cosmos/getting_started/operators.html).
* Support rendering tasks with non-ASCII characters by @t0momi219 in
#1278 [Read
more](https://astronomer.github.io/astronomer-cosmos/configuration/task-display-name.html)
* Add warning callback on source freshness by @pankajastro in #1400
[Read
more](https://astronomer.github.io/astronomer-cosmos/configuration/source-nodes-rendering.html#on-warning-callback-callback)
* Add Oracle Profile mapping by @slords and @pankajkoti in #1190 and
#1404
* Emit telemetry to Scarf during DAG run by @tatiana in #1397
* Save tasks map as ``DbtToAirflowConverter`` property by
@internetcoffeephone and @hheemskerk in #1362

**Bug Fixes**

* Fix the mock value of port in ``TrinoBaseProfileMapping`` to be an
integer by @dwolfeu #1322
* Fix access to the ``dbt docs`` menu item outside of Astro cloud by
@tatiana in #1312
* Add missing ``DbtSourceGcpCloudRunJobOperator`` in module
``cosmos.operators.gcp_cloud_run_job`` by @anai-s in #1290
* Support building ``DbtDag`` without setting paths in ``ProjectConfig``
by @tatiana in #1307
* Fix parsing dbt ls outputs that contain JSONs that are not dbt nodes
by @tatiana in #1296
* Fix Snowflake Profile mapping when using AWS default region by
@tatiana in #1406
* Fix dag rendering for taskflow + DbtTaskGroup combo by @pankajastro in
#1360

**Enhancements**

* Improve dbt command execution logs to troubleshoot ``None`` values by
@tatiana in #1392
* Add logging of stdout to dbt graph run_command by @KarolGongola in
#1390
* Save tasks map as DbtToAirflowConverter property by
@internetcoffeephone and @hheemskerk in #1362
* Support rendering build operator task-id with non-ASCII characters by
@pankajastro in #1415

**Docs**

* Remove extra ` char from docs by @pankajastro in #1345
* Add limitation about copying target dir files to remote by @pankajkoti
in #1305
* Generalise example from README by @ReadytoRocc in #1311
* Add security policy by @tatiana, @chaosmaw and @lzdanski in # 1385
* Mention in documentation that the callback functionality is supported
in ``ExecutionMode.VIRTUALENV`` by @pankajkoti in #1401

**Others**

* Restore Jaffle Shop so that ``basic_cosmos_dag`` works as documented
by @tatiana in #1374
* Remove Pytest durations from tests scripts by @tatiana in #1383
* Remove typing-extensions as dependency by @pankajastro in #1381
* Pin dbt-databricks version to < 1.9 by @pankajastro in #1376
* Refactor ``dbt-sqlite`` tests to use ``dbt-postgres`` by @pankajastro
in #1366
* Remove 'dbt-core<1.8.9' pin by @tatiana in #1371
* Remove dependency ``eval_type_backport`` by @tatiana in #1370
* Enable kubernetes tests for dbt>=1.8 by @pankajastro #1364
* CI Workaround: Pin dbt-core, Disable SQLite Tests, and Correctly
Ignore Clone Test to Pass CI by @pankajastro in #1337
* Enable Azure task in the remote store manifest example DAG by
@pankajkoti in #1333
* Enable GCP remote manifest task by @pankajastro in #1332
* Add exempt label option in GH action stale job by @pankajastro in
#1328
* Add integration test for source node rendering by @pankajastro in
#1327
* Fix vulnerability issue on docs dependency by @tatiana in #1313
* Add postgres pod status check for k8s tests in CI by @pankajkoti in
#1320
* [CI] Reduce the amount taking to run tests in the CI from 5h to 11min
by @tatiana in #1297
* Enable secret detection precommit check by @pankajastro in #1302
* Fix security vulnerability, by not pinning Airflow 2.10.0 by @tatiana
in #1298
* Fix Netlify build timeouts by @tatiana in #1294
* Add stalebot to label/close stale PRs and issues by @tatiana in #1288
* Unpin dbt-databricks version by @pankajastro in #1409
* Fix source resource type tests by @pankajastro in #1405
* Increase performance tests models by @tatiana in #1403
* Drop running 1000 models in the CI by @pankajkoti in #1411
* Fix releasing package to PyPI by @tatiana in #1396
* Pre-commit hook updates in #1394, #1373, #1358, #1340, #1331, #1314,
#1301

Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com>
Co-authored-by: Pankaj Singh <pankaj.singh@astronomer.io>

Closes: #1193

---------

Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com>
Co-authored-by: Pankaj Singh <98807258+pankajastro@users.noreply.github.com>
pankajkoti pushed a commit that referenced this pull request Dec 26, 2024
)

This PR addresses an issue where the `callback` callable parameter's
handling was modified in PR
#1389 leading to a
change in
the way keyword arguments like context were passed to the callable.

However, the Docs Operator has its own implementation of the callback
callable,
which only expects a single parameter, project_dir. As a result, passing
extra
keyword arguments like context is causing a mismatch and resulting in
the error described in #1420.

This PR adds the `kwargs` param in `upload_to_cloud_storage` method of the Docs operators

We have integration tests for these operator but look like CI does have
not required setup and it get ignored.

https://github.com/astronomer/astronomer-cosmos/blob/main/dev/dags/dbt_docs.py

related: #1420
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
execution:callback Tasks related to callback when executing tasks execution:local Related to Local execution environment lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
2 participants