Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to DBT_RUNNER execution mode #717

Closed
1 task
tatiana opened this issue Nov 28, 2023 · 1 comment · Fixed by #850
Closed
1 task

Add support to DBT_RUNNER execution mode #717

tatiana opened this issue Nov 28, 2023 · 1 comment · Fixed by #850
Assignees
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:dependencies Related to dependencies, like Python packages, library versions, etc area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:run Primarily related to dbt run command or functionality execution:local Related to Local execution environment priority:medium Medium priority issues are important issues that may have a workaround and medium impact

Comments

@tatiana
Copy link
Collaborator

tatiana commented Nov 28, 2023

Context

There was a great recommendation from @sanromeo in the#airflow-dbt slack channel:
https://apache-airflow.slack.com/archives/C059CC42E9W/p1701098801633179

To use DbtRunner (https://docs.getdbt.com/reference/programmatic-invocation) instead of subprocess for running (executing) dbt commands.

Historically, we decided not to adopt dbt-core as a dependency of Cosmos to avoid the conflicts between Airflow and dbt-core:
https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes-local-conflicts.html#execution-modes-local-conflicts

However, as pointed out by @sanromeo, there are no more conflicts between dbt 1.7.0+ and Airflow 2.7.0+. So we could offer this as an alternative Cosmos ExecutionMode to users who are confident their dbt-core and Airflow versions do not conflict. If this approach is successful, we can also look into allowing users to use the same strategy in the LoadMode.DBT_LS.

Acceptance criteria

  • Allow users to use ExecutionConfig(execution_mode= ExecutionMode.DBT_RUNNER), which will not rely on Python subprocess, but call dbt DbtRunner
Copy link

dosubot bot commented Nov 28, 2023

🤖

It seems like you've got this under control, if you want help or have specific questions, let me know what I can do for you!

@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration area:dependencies Related to dependencies, like Python packages, library versions, etc area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:run Primarily related to dbt run command or functionality execution:local Related to Local execution environment priority:medium Medium priority issues are important issues that may have a workaround and medium impact labels Nov 28, 2023
@jbandoro jbandoro self-assigned this Feb 3, 2024
jbandoro added a commit that referenced this issue Feb 27, 2024
…850)

## Description

This PR adds `dbtRunner` programmatic invocation for
`ExecutionMode.LOCAL`. I decided to not make a new execution mode for
each (e.g. `ExecutionMode.LOCAL_DBT_RUNNER`) and all of the child
operators but instead added an additional config
`ExecutionConfig.invocation_mode` where `InvocationMode.DBT_RUNNER`
could be specified. This is so that users who are already using local
execution mode could use dbt runner and see performance improvements.

With the `dbtRunnerResult` it makes it easy to know whether the dbt run
was successful and logs do not need to be parsed but are still logged in
the operator:


![image](https://github.com/astronomer/astronomer-cosmos/assets/79104794/76a4cf82-f0f2-4133-8d68-a0a6a145b1d8)

## Performance Testing

After #827 was added, I modified it slightly to use postgres adapter
instead of sqlite because the latest dbt-core support for sqlite is 1.4
when programmatic invocation requires >=1.5.0. I got the following
results comparing subprocess to dbt runner for 10 models:

1. `InvocationMode.SUBPROCESS`:
```shell
Ran 10 models in 23.77661895751953 seconds
NUM_MODELS=10
TIME=23.77661895751953
```
2. `InvocationMode.DBT_RUNNER`:
```shell
Ran 10 models in 8.390100002288818 seconds
NUM_MODELS=10
TIME=8.390100002288818
```

So using `InvocationMode.DBT_RUNNER` is almost 3x faster, and can speed
up dag runs if there are a lot of models that execute relatively quickly
since there seems to be a 1-2s speed up per task.


One thing I found while working on this is that a
[manifest](https://docs.getdbt.com/reference/programmatic-invocations#reusing-objects)
is stored in the result if you parse a project with the runner, and can
be reused in subsequent commands to avoid reparsing. This could be a
useful way for caching the manifest if we use dbt runner for dbt ls
parsing and could speed up the initial render as well.


I thought at first it would be easy to have this also work for
virtualenv execution, since I at first thought the entire `execute`
method was run in the virtualenv, which is not the case since the
virtualenv operator creates a virtualenv and then passes the executable
path to a subprocess. It may be possible to have this work for
virtualenv and would be better suited for a follow-up PR.

## Related Issue(s)

closes #717 

## Breaking Change?

None

## Checklist

- [x] I have made corresponding changes to the documentation (if
required)
- [x] I have added tests that prove my fix is effective or that my
feature works - added unit tests and integration tests.
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this issue Jul 14, 2024
…stronomer#850)

## Description

This PR adds `dbtRunner` programmatic invocation for
`ExecutionMode.LOCAL`. I decided to not make a new execution mode for
each (e.g. `ExecutionMode.LOCAL_DBT_RUNNER`) and all of the child
operators but instead added an additional config
`ExecutionConfig.invocation_mode` where `InvocationMode.DBT_RUNNER`
could be specified. This is so that users who are already using local
execution mode could use dbt runner and see performance improvements.

With the `dbtRunnerResult` it makes it easy to know whether the dbt run
was successful and logs do not need to be parsed but are still logged in
the operator:


![image](https://github.com/astronomer/astronomer-cosmos/assets/79104794/76a4cf82-f0f2-4133-8d68-a0a6a145b1d8)

## Performance Testing

After astronomer#827 was added, I modified it slightly to use postgres adapter
instead of sqlite because the latest dbt-core support for sqlite is 1.4
when programmatic invocation requires >=1.5.0. I got the following
results comparing subprocess to dbt runner for 10 models:

1. `InvocationMode.SUBPROCESS`:
```shell
Ran 10 models in 23.77661895751953 seconds
NUM_MODELS=10
TIME=23.77661895751953
```
2. `InvocationMode.DBT_RUNNER`:
```shell
Ran 10 models in 8.390100002288818 seconds
NUM_MODELS=10
TIME=8.390100002288818
```

So using `InvocationMode.DBT_RUNNER` is almost 3x faster, and can speed
up dag runs if there are a lot of models that execute relatively quickly
since there seems to be a 1-2s speed up per task.


One thing I found while working on this is that a
[manifest](https://docs.getdbt.com/reference/programmatic-invocations#reusing-objects)
is stored in the result if you parse a project with the runner, and can
be reused in subsequent commands to avoid reparsing. This could be a
useful way for caching the manifest if we use dbt runner for dbt ls
parsing and could speed up the initial render as well.


I thought at first it would be easy to have this also work for
virtualenv execution, since I at first thought the entire `execute`
method was run in the virtualenv, which is not the case since the
virtualenv operator creates a virtualenv and then passes the executable
path to a subprocess. It may be possible to have this work for
virtualenv and would be better suited for a follow-up PR.

## Related Issue(s)

closes astronomer#717 

## Breaking Change?

None

## Checklist

- [x] I have made corresponding changes to the documentation (if
required)
- [x] I have added tests that prove my fix is effective or that my
feature works - added unit tests and integration tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:dependencies Related to dependencies, like Python packages, library versions, etc area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:run Primarily related to dbt run command or functionality execution:local Related to Local execution environment priority:medium Medium priority issues are important issues that may have a workaround and medium impact
Projects
None yet
2 participants