Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to TestBehavior.BUILD #1377

Merged
merged 7 commits into from
Dec 12, 2024
Merged

Add support to TestBehavior.BUILD #1377

merged 7 commits into from
Dec 12, 2024

Conversation

tatiana
Copy link
Collaborator

@tatiana tatiana commented Dec 10, 2024

By default, Cosmos uses TestBehavior.AFTER_EACH, creating an Airflow TaskGroup that contains two tasks:

  • one to run the model, seed or snapshot
  • another to run the tests related to that dbt resource

While many users desire and expect this behaviour, it can also mean additional overhead, especially in dbt projects with more than 500 models. Each time the dbt command is executed, there is an overhead, even when using optimisations such as partial parsing and dbtRunner. There is also an overhead on splitting a task into multiple Airflow workers.

Illustrating some numbers with data shared by an Astronomer customer regarding the dbt command execution (between the logs "running dbt with arguments" and "Done."):

  • Running dbt build for a particular model + its tests: 46s
  • Running dbt run + dbt test individually: 2min15s

This PR introduces a new behaviour, TestBehavior.BUILD, where Cosmos can run both the model/seed/snapshot and the associated tests using a single command (dbt build). For documentation on the dbt build, check https://docs.getdbt.com/reference/commands/build.

This is an example of how the DAG will render when using this test behaviour when running:

 airflow dags test example_cosmos_dbt_build 
Screenshot 2024-12-10 at 15 08 45

And this is an example of the output, showing both the model is being run and also the tests, using the build command:

[2024-12-10 15:19:23,667] {local.py:405} INFO - Trying to run dbtRunner with:
 ['build', '--models', 'customers', '--full-refresh', '--project-dir', '/var/folders/td/522y78v91d1f5wgh67mj3p0m0000gn/T/tmpghz8naek', '--profiles-dir', '/tmp/profile/ac4e9cde9bc05d574c157e795dcbcc6b60246a73ca1d92d4fc669e90a1e494e0', '--profile', 'default', '--target', 'dev']
 in /var/folders/td/522y78v91d1f5wgh67mj3p0m0000gn/T/tmpghz8naek
[2024-12-10T15:19:23.667+0000] {local.py:405} INFO - Trying to run dbtRunner with:
 ['build', '--models', 'customers', '--full-refresh', '--project-dir', '/var/folders/td/522y78v91d1f5wgh67mj3p0m0000gn/T/tmpghz8naek', '--profiles-dir', '/tmp/profile/ac4e9cde9bc05d574c157e795dcbcc6b60246a73ca1d92d4fc669e90a1e494e0', '--profile', 'default', '--target', 'dev']
 in /var/folders/td/522y78v91d1f5wgh67mj3p0m0000gn/T/tmpghz8naek
15:19:23  Running with dbt=1.8.0
15:19:23  Registered adapter: postgres=1.8.0
15:19:23  Found 5 models, 3 seeds, 20 data tests, 528 macros
15:19:23  
15:19:23  Concurrency: 1 threads (target='dev')
15:19:23  
15:19:23  1 of 4 START sql table model public.customers .................................. [RUN]
15:19:23  1 of 4 OK created sql table model public.customers ............................. [SELECT 100 in 0.04s]
15:19:23  2 of 4 START test not_null_customers_customer_id ............................... [RUN]
15:19:23  2 of 4 PASS not_null_customers_customer_id ..................................... [PASS in 0.02s]
15:19:23  3 of 4 START test relationships_orders_customer_id__customer_id__ref_customers_  [RUN]
15:19:23  3 of 4 PASS relationships_orders_customer_id__customer_id__ref_customers_ ...... [PASS in 0.02s]
15:19:23  4 of 4 START test unique_customers_customer_id ................................. [RUN]
15:19:23  4 of 4 PASS unique_customers_customer_id ....................................... [PASS in 0.02s]
15:19:23  
15:19:23  Finished running 1 table model, 3 data tests in 0 hours 0 minutes and 0.19 seconds (0.19s).
15:19:24  
15:19:24  Completed successfully
15:19:24  
15:19:24  Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4

Closes: #892

Copy link

cloudflare-workers-and-pages bot commented Dec 10, 2024

Deploying astronomer-cosmos with  Cloudflare Pages  Cloudflare Pages

Latest commit: 988ae35
Status: ✅  Deploy successful!
Preview URL: https://3fa1c3fd.astronomer-cosmos.pages.dev
Branch Preview URL: https://issue-892.astronomer-cosmos.pages.dev

View logs

@tatiana tatiana marked this pull request as ready for review December 11, 2024 16:32
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. area:performance Related to performance, like memory usage, CPU usage, speed, etc area:rendering Related to rendering, like Jinja, Airflow tasks, etc dbt:build Primarily related to dbt build command or functionality labels Dec 11, 2024
tests/airflow/test_graph.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dwreeves dwreeves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code makes sense to me, so giving it a LGTM. Thanks for taking this on @tatiana!


A thought as I read this, for another day though:

dbt 1.8+ has a new concept called unit tests, which it differentiates from data tests (formerly just "tests"). dbt's preferred way of running things is: unit tests -> build model -> data tests.

For the TestBehavior.BUILD, we comport with how dbt wants to run things 👍 because the order of operations is resolved automatically by dbt build.

We don't currently handle dbt unit tests for TestBehavior.AFTER_ALL and TestBehavior.AFTER_EACH though. 🤔 A natural way to do this would be to have AFTER_ALL also mean "before all" when it comes to unit tests, and similarly AFTER_EACH means "before each" for unit tests.

Alternatively, these could be decoupled; i.e. "before each, unit test" and "after all, data test," similarly "before all, unit test" and "after each, data test." But for TestBehavior.BUILD, decoupling gets weird. 🤷

No actionable item here other than, potentially, to mention in the docs that TestBehavior.BUILD is currently the only way to run dbt unit tests using Cosmos's automatic graph parsing. Just thinking out loud about the future.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 11, 2024
Copy link
Contributor

@pankajkoti pankajkoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Amazing support added and nice optimization. Happy to merge once #1374 is merged & this branch is rebased, and the CI passes after.

cosmos/constants.py Show resolved Hide resolved
Copy link

codecov bot commented Dec 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.24%. Comparing base (2fa5d01) to head (988ae35).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1377   +/-   ##
=======================================
  Coverage   96.23%   96.24%           
=======================================
  Files          67       67           
  Lines        4042     4051    +9     
=======================================
+ Hits         3890     3899    +9     
  Misses        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tatiana
Copy link
Collaborator Author

tatiana commented Dec 12, 2024

Thanks a lot for the review and feedback, @dwreeves @pankajastro @pankajkoti!

I added a follow-up action to review what @dwreeves mentioned (unit tests x data tests split): #1386 I temptively added this to Cosmos 2.x, but we may want to handle this before.

@tatiana tatiana merged commit 110fb07 into main Dec 12, 2024
64 checks passed
@tatiana tatiana deleted the issue-892 branch December 12, 2024 11:56
@tatiana tatiana added this to the Cosmos 1.8.0 milestone Dec 16, 2024
@tatiana tatiana restored the issue-892 branch December 16, 2024 14:19
@tatiana tatiana deleted the issue-892 branch December 16, 2024 14:20
@tatiana tatiana mentioned this pull request Dec 17, 2024
tatiana added a commit that referenced this pull request Dec 20, 2024
**New Features**

* Support customizing Airflow operator arguments per dbt node by @wornjs
in #1339. [More
information](https://astronomer.github.io/astronomer-cosmos/getting_started/custom-airflow-properties.html).
* Support uploading dbt artifacts to remote cloud storages via callback
by @pankajkoti in #1389. [Read
more](https://astronomer.github.io/astronomer-cosmos/configuration/callbacks.html).
* Add support to ``TestBehavior.BUILD`` by @tatiana in #1377.
[Documentation](https://astronomer.github.io/astronomer-cosmos/configuration/testing-behavior.html).
* Add support for the "at" operator when using ``LoadMode.DBT_MANIFEST``
or ``CUSTOM`` by @benjy44 in #1372
* Add dbt clone operator by @pankajastro in #1326, as documented in
[here](https://astronomer.github.io/astronomer-cosmos/getting_started/operators.html).
* Support rendering tasks with non-ASCII characters by @t0momi219 in
#1278 [Read
more](https://astronomer.github.io/astronomer-cosmos/configuration/task-display-name.html)
* Add warning callback on source freshness by @pankajastro in #1400
[Read
more](https://astronomer.github.io/astronomer-cosmos/configuration/source-nodes-rendering.html#on-warning-callback-callback)
* Add Oracle Profile mapping by @slords and @pankajkoti in #1190 and
#1404
* Emit telemetry to Scarf during DAG run by @tatiana in #1397
* Save tasks map as ``DbtToAirflowConverter`` property by
@internetcoffeephone and @hheemskerk in #1362

**Bug Fixes**

* Fix the mock value of port in ``TrinoBaseProfileMapping`` to be an
integer by @dwolfeu #1322
* Fix access to the ``dbt docs`` menu item outside of Astro cloud by
@tatiana in #1312
* Add missing ``DbtSourceGcpCloudRunJobOperator`` in module
``cosmos.operators.gcp_cloud_run_job`` by @anai-s in #1290
* Support building ``DbtDag`` without setting paths in ``ProjectConfig``
by @tatiana in #1307
* Fix parsing dbt ls outputs that contain JSONs that are not dbt nodes
by @tatiana in #1296
* Fix Snowflake Profile mapping when using AWS default region by
@tatiana in #1406
* Fix dag rendering for taskflow + DbtTaskGroup combo by @pankajastro in
#1360

**Enhancements**

* Improve dbt command execution logs to troubleshoot ``None`` values by
@tatiana in #1392
* Add logging of stdout to dbt graph run_command by @KarolGongola in
#1390
* Save tasks map as DbtToAirflowConverter property by
@internetcoffeephone and @hheemskerk in #1362
* Support rendering build operator task-id with non-ASCII characters by
@pankajastro in #1415

**Docs**

* Remove extra ` char from docs by @pankajastro in #1345
* Add limitation about copying target dir files to remote by @pankajkoti
in #1305
* Generalise example from README by @ReadytoRocc in #1311
* Add security policy by @tatiana, @chaosmaw and @lzdanski in # 1385
* Mention in documentation that the callback functionality is supported
in ``ExecutionMode.VIRTUALENV`` by @pankajkoti in #1401

**Others**

* Restore Jaffle Shop so that ``basic_cosmos_dag`` works as documented
by @tatiana in #1374
* Remove Pytest durations from tests scripts by @tatiana in #1383
* Remove typing-extensions as dependency by @pankajastro in #1381
* Pin dbt-databricks version to < 1.9 by @pankajastro in #1376
* Refactor ``dbt-sqlite`` tests to use ``dbt-postgres`` by @pankajastro
in #1366
* Remove 'dbt-core<1.8.9' pin by @tatiana in #1371
* Remove dependency ``eval_type_backport`` by @tatiana in #1370
* Enable kubernetes tests for dbt>=1.8 by @pankajastro #1364
* CI Workaround: Pin dbt-core, Disable SQLite Tests, and Correctly
Ignore Clone Test to Pass CI by @pankajastro in #1337
* Enable Azure task in the remote store manifest example DAG by
@pankajkoti in #1333
* Enable GCP remote manifest task by @pankajastro in #1332
* Add exempt label option in GH action stale job by @pankajastro in
#1328
* Add integration test for source node rendering by @pankajastro in
#1327
* Fix vulnerability issue on docs dependency by @tatiana in #1313
* Add postgres pod status check for k8s tests in CI by @pankajkoti in
#1320
* [CI] Reduce the amount taking to run tests in the CI from 5h to 11min
by @tatiana in #1297
* Enable secret detection precommit check by @pankajastro in #1302
* Fix security vulnerability, by not pinning Airflow 2.10.0 by @tatiana
in #1298
* Fix Netlify build timeouts by @tatiana in #1294
* Add stalebot to label/close stale PRs and issues by @tatiana in #1288
* Unpin dbt-databricks version by @pankajastro in #1409
* Fix source resource type tests by @pankajastro in #1405
* Increase performance tests models by @tatiana in #1403
* Drop running 1000 models in the CI by @pankajkoti in #1411
* Fix releasing package to PyPI by @tatiana in #1396
* Pre-commit hook updates in #1394, #1373, #1358, #1340, #1331, #1314,
#1301

Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com>
Co-authored-by: Pankaj Singh <pankaj.singh@astronomer.io>

Closes: #1193

---------

Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com>
Co-authored-by: Pankaj Singh <98807258+pankajastro@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:performance Related to performance, like memory usage, CPU usage, speed, etc area:rendering Related to rendering, like Jinja, Airflow tasks, etc dbt:build Primarily related to dbt build command or functionality lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Build operator as a test behavior: TestBehavior.BUILD
4 participants