-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to DBT_RUNNER execution mode #717
Labels
area:config
Related to configuration, like YAML files, environment variables, or executer configuration
area:dependencies
Related to dependencies, like Python packages, library versions, etc
area:execution
Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc
dbt:run
Primarily related to dbt run command or functionality
execution:local
Related to Local execution environment
priority:medium
Medium priority issues are important issues that may have a workaround and medium impact
Comments
🤖 It seems like you've got this under control, if you want help or have specific questions, let me know what I can do for you! |
2 tasks
2 tasks
jbandoro
added a commit
that referenced
this issue
Feb 27, 2024
…850) ## Description This PR adds `dbtRunner` programmatic invocation for `ExecutionMode.LOCAL`. I decided to not make a new execution mode for each (e.g. `ExecutionMode.LOCAL_DBT_RUNNER`) and all of the child operators but instead added an additional config `ExecutionConfig.invocation_mode` where `InvocationMode.DBT_RUNNER` could be specified. This is so that users who are already using local execution mode could use dbt runner and see performance improvements. With the `dbtRunnerResult` it makes it easy to know whether the dbt run was successful and logs do not need to be parsed but are still logged in the operator:  ## Performance Testing After #827 was added, I modified it slightly to use postgres adapter instead of sqlite because the latest dbt-core support for sqlite is 1.4 when programmatic invocation requires >=1.5.0. I got the following results comparing subprocess to dbt runner for 10 models: 1. `InvocationMode.SUBPROCESS`: ```shell Ran 10 models in 23.77661895751953 seconds NUM_MODELS=10 TIME=23.77661895751953 ``` 2. `InvocationMode.DBT_RUNNER`: ```shell Ran 10 models in 8.390100002288818 seconds NUM_MODELS=10 TIME=8.390100002288818 ``` So using `InvocationMode.DBT_RUNNER` is almost 3x faster, and can speed up dag runs if there are a lot of models that execute relatively quickly since there seems to be a 1-2s speed up per task. One thing I found while working on this is that a [manifest](https://docs.getdbt.com/reference/programmatic-invocations#reusing-objects) is stored in the result if you parse a project with the runner, and can be reused in subsequent commands to avoid reparsing. This could be a useful way for caching the manifest if we use dbt runner for dbt ls parsing and could speed up the initial render as well. I thought at first it would be easy to have this also work for virtualenv execution, since I at first thought the entire `execute` method was run in the virtualenv, which is not the case since the virtualenv operator creates a virtualenv and then passes the executable path to a subprocess. It may be possible to have this work for virtualenv and would be better suited for a follow-up PR. ## Related Issue(s) closes #717 ## Breaking Change? None ## Checklist - [x] I have made corresponding changes to the documentation (if required) - [x] I have added tests that prove my fix is effective or that my feature works - added unit tests and integration tests.
arojasb3
pushed a commit
to arojasb3/astronomer-cosmos
that referenced
this issue
Jul 14, 2024
…stronomer#850) ## Description This PR adds `dbtRunner` programmatic invocation for `ExecutionMode.LOCAL`. I decided to not make a new execution mode for each (e.g. `ExecutionMode.LOCAL_DBT_RUNNER`) and all of the child operators but instead added an additional config `ExecutionConfig.invocation_mode` where `InvocationMode.DBT_RUNNER` could be specified. This is so that users who are already using local execution mode could use dbt runner and see performance improvements. With the `dbtRunnerResult` it makes it easy to know whether the dbt run was successful and logs do not need to be parsed but are still logged in the operator:  ## Performance Testing After astronomer#827 was added, I modified it slightly to use postgres adapter instead of sqlite because the latest dbt-core support for sqlite is 1.4 when programmatic invocation requires >=1.5.0. I got the following results comparing subprocess to dbt runner for 10 models: 1. `InvocationMode.SUBPROCESS`: ```shell Ran 10 models in 23.77661895751953 seconds NUM_MODELS=10 TIME=23.77661895751953 ``` 2. `InvocationMode.DBT_RUNNER`: ```shell Ran 10 models in 8.390100002288818 seconds NUM_MODELS=10 TIME=8.390100002288818 ``` So using `InvocationMode.DBT_RUNNER` is almost 3x faster, and can speed up dag runs if there are a lot of models that execute relatively quickly since there seems to be a 1-2s speed up per task. One thing I found while working on this is that a [manifest](https://docs.getdbt.com/reference/programmatic-invocations#reusing-objects) is stored in the result if you parse a project with the runner, and can be reused in subsequent commands to avoid reparsing. This could be a useful way for caching the manifest if we use dbt runner for dbt ls parsing and could speed up the initial render as well. I thought at first it would be easy to have this also work for virtualenv execution, since I at first thought the entire `execute` method was run in the virtualenv, which is not the case since the virtualenv operator creates a virtualenv and then passes the executable path to a subprocess. It may be possible to have this work for virtualenv and would be better suited for a follow-up PR. ## Related Issue(s) closes astronomer#717 ## Breaking Change? None ## Checklist - [x] I have made corresponding changes to the documentation (if required) - [x] I have added tests that prove my fix is effective or that my feature works - added unit tests and integration tests.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area:config
Related to configuration, like YAML files, environment variables, or executer configuration
area:dependencies
Related to dependencies, like Python packages, library versions, etc
area:execution
Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc
dbt:run
Primarily related to dbt run command or functionality
execution:local
Related to Local execution environment
priority:medium
Medium priority issues are important issues that may have a workaround and medium impact
Context
There was a great recommendation from @sanromeo in the
#airflow-dbt
slack channel:https://apache-airflow.slack.com/archives/C059CC42E9W/p1701098801633179
To use
DbtRunner
(https://docs.getdbt.com/reference/programmatic-invocation) instead ofsubprocess
for running (executing) dbt commands.Historically, we decided not to adopt
dbt-core
as a dependency of Cosmos to avoid the conflicts between Airflow anddbt-core
:https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes-local-conflicts.html#execution-modes-local-conflicts
However, as pointed out by @sanromeo, there are no more conflicts between dbt 1.7.0+ and Airflow 2.7.0+. So we could offer this as an alternative Cosmos ExecutionMode to users who are confident their dbt-core and Airflow versions do not conflict. If this approach is successful, we can also look into allowing users to use the same strategy in the
LoadMode.DBT_LS
.Acceptance criteria
ExecutionConfig(execution_mode= ExecutionMode.DBT_RUNNER)
, which will not rely on Pythonsubprocess
, but call dbtDbtRunner
The text was updated successfully, but these errors were encountered: