Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure Airflow tasks using dbt model meta #1339

Merged
merged 17 commits into from
Dec 17, 2024

Conversation

wornjs
Copy link
Contributor

@wornjs wornjs commented Nov 25, 2024

Description

The various dbt models have unique characteristics, and some may require the use of custom pools, queues, or other specific configurations. To support such cases, this update introduces the ability to add necessary information in the meta section of the dbt model.yaml. This metadata is then passed as kwargs to the corresponding Airflow tasks, enabling model-specific customization and enhanced task configuration.

here is sample

DbtTaskGroup - default_args for all dbt models

    dbt_task_group = DbtTaskGroup(
        project_config=,
        profile_config=ProfileConfig,
        default_args={'pool': dbt_pool}
    )
version: 2

models:
  - name: name
    description: description
    meta:
      owner: 'jaegwon.seo@toss.im'
      cosmos:
        operator_kwargs:
          pool: abcd

result

general pool
스크린샷 2024-11-25 오후 10 15 26
custom pool
스크린샷 2024-11-25 오후 10 15 40

Related Issue(s)

Closes: #881
Closes: #1325

Breaking Change?

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Nov 25, 2024
@dosubot dosubot bot added the area:config Related to configuration, like YAML files, environment variables, or executer configuration label Nov 25, 2024
@wornjs
Copy link
Contributor Author

wornjs commented Nov 25, 2024

It would be great to add documentation for the features introduced in this PR. However, looking at the current project, it seems there aren’t any markdown files for documentation apart from the CONTRIBUTING file. Where do you think would be the best place to add this documentation?

@pankajastro
Copy link
Contributor

pankajastro commented Nov 26, 2024

It would be great to add documentation for the features introduced in this PR. However, looking at the current project, it seems there aren’t any markdown files for documentation apart from the CONTRIBUTING file. Where do you think would be the best place to add this documentation?

We use rst. You can find docs at https://github.com/astronomer/astronomer-cosmos/tree/main/docs

@pankajastro
Copy link
Contributor

Could you please rebase this PR

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 96.21%. Comparing base (fdf7025) to head (7242dcf).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
cosmos/dbt/graph.py 86.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1339      +/-   ##
==========================================
- Coverage   96.24%   96.21%   -0.04%     
==========================================
  Files          67       67              
  Lines        4051     4070      +19     
==========================================
+ Hits         3899     3916      +17     
- Misses        152      154       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@linchun3
Copy link
Contributor

linchun3 commented Dec 5, 2024

related issue?
#881

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @wornjs,

The per-node configuration has been a long-standing request (example: #881 (comment)), and your PR solves this. This is an exciting feature; thanks for contributing to Cosmos!

A question and two requests:

  1. What are the advantages/disadvantages of having these properties in meta as opposed to having them in config:
    meta:
      cosmos:
        pool: abcd

versus

    config:
      cosmos:
        pool: abcd
  1. Please, could you add tests to cover this feature?

  2. We'll also need docs.

If you can address these before 20 December, we can ship Cosmos 1.8. I'm tentatively adding this to that milestone.

@tatiana tatiana added this to the Cosmos 1.8.0 milestone Dec 10, 2024
@wornjs
Copy link
Contributor Author

wornjs commented Dec 10, 2024

Hi, @wornjs,

The per-node configuration has been a long-standing request (example: #881 (comment)), and your PR solves this. This is an exciting feature; thanks for contributing to Cosmos!

A question and two requests:

  1. What are the advantages/disadvantages of having these properties in meta as opposed to having them in config:
    meta:
      cosmos:
        pool: abcd

versus

    config:
      cosmos:
        pool: abcd
  1. Please, could you add tests to cover this feature?
  2. We'll also need docs.

If you can address these before 20 December, we can ship Cosmos 1.8. I'm tentatively adding this to that milestone.

i'm gonna add test until this week

@dwreeves
Copy link
Collaborator

dwreeves commented Dec 11, 2024

Hi, responding here instead of in #1325 as it seems the discussion has migrated here.

First, this PR is related to #881. This issue was for supporting arbitrary kwargs, not just a single kwarg.

I was a big fan of where we ended up with the API of that proposal, and there are a few differences.

The first difference, as @tatiana brings up, is that it uses config: instead of meta:. I do not believe meta: is a valid attribute of a model specification. https://github.com/dbt-labs/dbt-core/blob/fc6167a2ee5291cf6c562eadcd975b48e6a34d65/core/dbt/artifacts/resources/v1/model.py#L38-L48 Please let me know if I am wrong, but I do believe it should be config: and not meta: based on the dbt-core source code. I am wrong, it is! There is a chain of inheritances that lead to meta. My mistake. With that out of the way, I am not sure what I prefer between config and meta, to be honest. I would ask the same question as Tatiana again then, i.e. what do you see as the difference between one or the other and why should one (in this case meta) be preferred?

The second difference is, as mentioned, it supports arbitrary kwargs, not just pool. There are a lot of reasons why someone may want to inject many different kwargs into their operators, not just pool.

The third difference is that the namespace was cosmos: operator_args: pool:, not just cosmos: pool:. I think this is really important once you move the API from just pool to any kwarg. The issue is, in the future, we may want to add additional options that are not actually kwargs (or alternatively allow users to inject such things themselves), so we would need to keep the kwargs separate of things that are not kwargs.

So my final proposal would be this:

version: 2
models:
  - name: model_a
    config:
      alias: model_a
      cosmos:
        operator_args:
          pool: my-pool-here

(or alternatively, if using meta:)

version: 2
models:
  - name: model_a
    config:
      alias: model_a
    meta:
      cosmos:
        operator_args:
          pool: my-pool-here

^ So the key difference is there is one extra dict, i.e. operator_args.

But really, I think we should just **unpack the whole cosmos.operator_args into the operator so that users can do things like this:

version: 2
models:
  - name: model_a
    config:
      alias: model_a
      cosmos:
        operator_args:
          pool: my-pool-here
          retries: 4
          conn_id: special_conn_id

@tatiana tatiana mentioned this pull request Dec 11, 2024
2 tasks
@dwreeves
Copy link
Collaborator

dwreeves commented Dec 11, 2024

I've confirmed both meta and config have the same merge behavior: both merge the top-level keys, but neither recursively merges, which is annoying because

  1. it makes it harder to decide whether cosmos should be in meta: or config:
  2. it also means there is no way for users to do something like +config: cosmos: operator_args: x: ... in dbt_project.yml and config(cosmos={"operator_args": "y": ...}) in a sql model and get both x and y 🫤

(The only way to support merging this way would be to use top-level attributes, but I think that's tedious and limiting.)

There is an issue open relating to this in dbt-core where someone suggests recursive merging for meta: dbt-labs/dbt-core#10946

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Dec 16, 2024
@wornjs
Copy link
Contributor Author

wornjs commented Dec 16, 2024

I thought of config as settings related to dbt (e.g., materialized, tag), and meta as a way to display additional information beyond that, so I used the values in meta. Although config and meta are somewhat similar, I will follow the agreed-upon conventions.

@wornjs
Copy link
Contributor Author

wornjs commented Dec 16, 2024

@tatiana
sorry for late commit plz check this

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wornjs, thank you very much fo adding tests and documentation. Your work will be one of the main features of Cosmos 1.8.

Based on what you and @dwreeves confirmed, there is little difference between config and meta in dbt; I'm happy with using meta, as you implemented. Thank you for checking/explaining.

Please address only one last thing I'd like you to address before we release this feature (if possible, by 20th December). The request is to have, within the "cosmos" meta config, a "operator_args" key. It is a minor change of interface from what you originally proposed:

version: 2

Models:
  - name: name
    description: description
    meta:
      owner: 'jaegwon.seo@toss.im'
      cosmos:
        operator_kwargs:
          pool: abcd

This suggestion, originally from @dwreeves, can be beneficial as we expand the usage of this feature. Let's say, for instance, we'd like users to be able to configure the ExecutionMode by dbt model. This is not an operator kwarg, so having the operator_args key would allow us to differentiate between properties that may affect a model execution but are not necessarily arguments passed to the operator itself.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 16, 2024
@wornjs
Copy link
Contributor Author

wornjs commented Dec 16, 2024

@tatiana
i fix it plz check ^^

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @wornjs for improving Cosmos and adding this feature.

I can't wait to hear the feedback from the community. Given the release timeline, I'll merge even though the coverage could be improved - we can improve this in the future.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 17, 2024
@tatiana tatiana merged commit cbd8622 into astronomer:main Dec 17, 2024
61 of 63 checks passed
@tatiana tatiana mentioned this pull request Dec 17, 2024
tatiana added a commit that referenced this pull request Dec 20, 2024
**New Features**

* Support customizing Airflow operator arguments per dbt node by @wornjs
in #1339. [More
information](https://astronomer.github.io/astronomer-cosmos/getting_started/custom-airflow-properties.html).
* Support uploading dbt artifacts to remote cloud storages via callback
by @pankajkoti in #1389. [Read
more](https://astronomer.github.io/astronomer-cosmos/configuration/callbacks.html).
* Add support to ``TestBehavior.BUILD`` by @tatiana in #1377.
[Documentation](https://astronomer.github.io/astronomer-cosmos/configuration/testing-behavior.html).
* Add support for the "at" operator when using ``LoadMode.DBT_MANIFEST``
or ``CUSTOM`` by @benjy44 in #1372
* Add dbt clone operator by @pankajastro in #1326, as documented in
[here](https://astronomer.github.io/astronomer-cosmos/getting_started/operators.html).
* Support rendering tasks with non-ASCII characters by @t0momi219 in
#1278 [Read
more](https://astronomer.github.io/astronomer-cosmos/configuration/task-display-name.html)
* Add warning callback on source freshness by @pankajastro in #1400
[Read
more](https://astronomer.github.io/astronomer-cosmos/configuration/source-nodes-rendering.html#on-warning-callback-callback)
* Add Oracle Profile mapping by @slords and @pankajkoti in #1190 and
#1404
* Emit telemetry to Scarf during DAG run by @tatiana in #1397
* Save tasks map as ``DbtToAirflowConverter`` property by
@internetcoffeephone and @hheemskerk in #1362

**Bug Fixes**

* Fix the mock value of port in ``TrinoBaseProfileMapping`` to be an
integer by @dwolfeu #1322
* Fix access to the ``dbt docs`` menu item outside of Astro cloud by
@tatiana in #1312
* Add missing ``DbtSourceGcpCloudRunJobOperator`` in module
``cosmos.operators.gcp_cloud_run_job`` by @anai-s in #1290
* Support building ``DbtDag`` without setting paths in ``ProjectConfig``
by @tatiana in #1307
* Fix parsing dbt ls outputs that contain JSONs that are not dbt nodes
by @tatiana in #1296
* Fix Snowflake Profile mapping when using AWS default region by
@tatiana in #1406
* Fix dag rendering for taskflow + DbtTaskGroup combo by @pankajastro in
#1360

**Enhancements**

* Improve dbt command execution logs to troubleshoot ``None`` values by
@tatiana in #1392
* Add logging of stdout to dbt graph run_command by @KarolGongola in
#1390
* Save tasks map as DbtToAirflowConverter property by
@internetcoffeephone and @hheemskerk in #1362
* Support rendering build operator task-id with non-ASCII characters by
@pankajastro in #1415

**Docs**

* Remove extra ` char from docs by @pankajastro in #1345
* Add limitation about copying target dir files to remote by @pankajkoti
in #1305
* Generalise example from README by @ReadytoRocc in #1311
* Add security policy by @tatiana, @chaosmaw and @lzdanski in # 1385
* Mention in documentation that the callback functionality is supported
in ``ExecutionMode.VIRTUALENV`` by @pankajkoti in #1401

**Others**

* Restore Jaffle Shop so that ``basic_cosmos_dag`` works as documented
by @tatiana in #1374
* Remove Pytest durations from tests scripts by @tatiana in #1383
* Remove typing-extensions as dependency by @pankajastro in #1381
* Pin dbt-databricks version to < 1.9 by @pankajastro in #1376
* Refactor ``dbt-sqlite`` tests to use ``dbt-postgres`` by @pankajastro
in #1366
* Remove 'dbt-core<1.8.9' pin by @tatiana in #1371
* Remove dependency ``eval_type_backport`` by @tatiana in #1370
* Enable kubernetes tests for dbt>=1.8 by @pankajastro #1364
* CI Workaround: Pin dbt-core, Disable SQLite Tests, and Correctly
Ignore Clone Test to Pass CI by @pankajastro in #1337
* Enable Azure task in the remote store manifest example DAG by
@pankajkoti in #1333
* Enable GCP remote manifest task by @pankajastro in #1332
* Add exempt label option in GH action stale job by @pankajastro in
#1328
* Add integration test for source node rendering by @pankajastro in
#1327
* Fix vulnerability issue on docs dependency by @tatiana in #1313
* Add postgres pod status check for k8s tests in CI by @pankajkoti in
#1320
* [CI] Reduce the amount taking to run tests in the CI from 5h to 11min
by @tatiana in #1297
* Enable secret detection precommit check by @pankajastro in #1302
* Fix security vulnerability, by not pinning Airflow 2.10.0 by @tatiana
in #1298
* Fix Netlify build timeouts by @tatiana in #1294
* Add stalebot to label/close stale PRs and issues by @tatiana in #1288
* Unpin dbt-databricks version by @pankajastro in #1409
* Fix source resource type tests by @pankajastro in #1405
* Increase performance tests models by @tatiana in #1403
* Drop running 1000 models in the CI by @pankajkoti in #1411
* Fix releasing package to PyPI by @tatiana in #1396
* Pre-commit hook updates in #1394, #1373, #1358, #1340, #1331, #1314,
#1301

Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com>
Co-authored-by: Pankaj Singh <pankaj.singh@astronomer.io>

Closes: #1193

---------

Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com>
Co-authored-by: Pankaj Singh <98807258+pankajastro@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
5 participants