Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3569] [Feature] Add a --favor-state-selector flag that supports node selection syntax #9410

Closed
3 tasks done
d-cole opened this issue Jan 19, 2024 · 10 comments
Closed
3 tasks done
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core

Comments

@d-cole
Copy link
Contributor

d-cole commented Jan 19, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

It is often desirable to defer some but not all upstream parents. The current behaviour of --favor-state does not easily allow for node-specific selection. For example, consider the following scenario:

model_a ----\
             \
              --> model_c 
             /
model_b ----/
  • A user is adding model_c and making modifications to model_b.
  • model_a and model_b both exist in the prod and dev target.
  • The user would like to execute model_c (dev target) using the inputs model_b (dev target) and model_a (prod target).

Alternatives:

  • dbt run -s model_c --defer --favor-state --state prod-run-artifacts does not work as it will select model_b (prod) unless model_b (prod) is dropped.
  • dbt run -s model_c --defer --state prod-run-artifacts cannot achieve this behaviour without first dropping model_a (dev). While the cost of dropping model_a (dev) seems trivial in this example, it becomes increasingly cumbersome as the number of inputs grows. This requires the user to spend time to identify what upstreams exist in their dev target and to drop them.
  • dbt run -s model_b+ --defer --favor-state --state prod-run-artifacts will work but requires model_b and model_a to be executed in the same command. This can cause issues when the upstream dev target models have a long runtime. Additionally this can waste compute if the user has already recently run model_b in their dev target.

Desired Behaviour:

This experience could be improved by defer supporting favor state node selection. Considering the example above, what this could look like is shown below. Ideally --favor-state-selector would support all the node selection syntax.

dbt run -s model_c --defer --favor-state-selector model_a --state prod-run-artifacts

Who will this benefit?

Users with large models that take a long time and are expensive to recompute.

Are you interested in contributing this feature?

Yes

@d-cole d-cole added enhancement New feature or request triage labels Jan 19, 2024
@github-actions github-actions bot changed the title [Feature] <title> [CT-3569] [Feature] <title> Jan 19, 2024
@dbeatty10 dbeatty10 changed the title [CT-3569] [Feature] <title> [CT-3569] [Feature] Add a --favor-state-selector flag that supports node selection syntax Jan 19, 2024
@d-cole d-cole changed the title [CT-3569] [Feature] Add a --favor-state-selector flag that supports node selection syntax [CT-3569] [Feature] Improved Defer Experience - Favor State Node Selection Jan 19, 2024
@d-cole d-cole changed the title [CT-3569] [Feature] Improved Defer Experience - Favor State Node Selection [CT-3569] [Feature] Add a --favor-state-selector flag that supports node selection syntax Jan 22, 2024
@graciegoheen
Copy link
Contributor

Hi! Thanks for opening this issue. I'm thinking through this, and wondering if you can use the state:modified selector to achieve the desired behavior.

If you were to run dbt run --select state:modified --defer --favor-state --state prod-run-artifacts, that would:

  • state:modified -> select the models that you've modified when comparing dev to prod, so this would select just model_c and model_b
  • favor-state -> dbt will favor using the node defined in your --state namespace, even if the node exists in the current target.

Can you give that a go, and let me know what you think?

@d-cole
Copy link
Contributor Author

d-cole commented Jan 24, 2024

Hi! Thanks for opening this issue. I'm thinking through this, and wondering if you can use the state:modified selector to achieve the desired behavior.

If you were to run dbt run --select state:modified --defer --favor-state --state prod-run-artifacts, that would:

  • state:modified -> select the models that you've modified when comparing dev to prod, so this would select just model_c and model_b
  • favor-state -> dbt will favor using the node defined in your --state namespace, even if the node exists in the current target.

Can you give that a go, and let me know what you think?

Hi Grace, thank you for taking a look! I believe using --select state:modified in here will be equivalent to the -s model_b+ option:

  • dbt run -s model_b+ --defer --favor-state --state prod-run-artifacts will work but requires model_b and model_a to be executed in the same command. This can cause issues when the upstream dev target models have a long runtime. Additionally this can waste compute if the user has already recently run model_b in their dev target.

It does achieve the desired defer selection behaviour but compared to a --favor-state-selector has two downsides:

  1. It requires running model_b and model_c in the same invocation. If I have already created model_b in my dev space this will mean rerunning the model and wasting compute and time. If the model runtime is sufficiently long it would be faster to drop model_a (dev) and use defer without the --favor-state.
  2. It is less flexible as it requires exactly the nodes you don't want to defer to be modified in your current branch. This wouldn't work in two scenarios:
    2.1 Model_b was edited in another branch than model_c to make PR reviews easier.
    2.2 All three models were edited but I still wanted to use model_a (prod) while testing model_c. This could be desirable during reconciliation to compare model_c (dev) with input model_a (prod) to a run of model_c(dev) with input model_a (dev).

@dbeatty10
Copy link
Contributor

Jumping in to offer an idea with one more layer -- wanna give it a shot @d-cole ?

Producing multi-step artifacts

Let's say you want to:

  1. Get references for model_b and model_c from the dev target and everything else from the prod target.
  2. Build model_c only

Then you can do the following:

  1. Craft a new state that is a hybrid of multiple targets:
dbt compile --select model_b+ --defer --favor-state --state prod-run-artifacts --target dev
  1. Save it as a new state:
rm -rf hybrid-run-artifacts
cp -r target hybrid-run-artifacts
  1. Favor it:
dbt run --select model_c --defer --favor-state --state hybrid-run-artifacts --target dev

For me, this gives:

prod.model_a ----\
                  \
                   --> dev.model_c 
                  /
 dev.model_b ----/

You can repeat this approach successively with different selectors and targets to create an N-layer burrito with all your desired ingredients.

@dbeatty10
Copy link
Contributor

Here's a tangible example that uses the the --target-path flag to store state.

First, suppose we have a model_d that depends on a, b, and c:

models/model_d.sql

-- model_d: {{ this }}
-- depends on:
--   model_a: {{ ref("model_a") }}
--   model_b: {{ ref("model_b") }}
--   model_b: {{ ref("model_c") }}

select 1 as id

We can layer a series of target states like this to mix-n-match where the references are coming from:

dbt compile --target prod --target-path prod-run-artifacts
dbt compile --select model_b+ --defer --favor-state --state prod-run-artifacts --target dev1 --target-path hybrid1-run-artifacts
dbt compile --select model_c  --defer --favor-state --state hybrid1-run-artifacts --target dev2 --target-path hybrid2-run-artifacts
dbt compile --select model_d  --defer --favor-state --state hybrid2-run-artifacts --target dev2

The final command gives the following output:

12:55:10  Compiled node 'model_d' is:
-- model_d: "db"."feature_2"."model_d"
-- depends on:
--   model_a: "db"."prod"."model_a"
--   model_b: "db"."feature_1"."model_b"
--   model_b: "db"."feature_2"."model_c"

select 1 as id

@d-cole
Copy link
Contributor Author

d-cole commented Jan 30, 2024

Hey @dbeatty10, thank you for the detailed response. I didn't know you could create a n-layer burrito like that, it is really cool! This approach does allow for the described node selection without any of the downsides of the alternatives mentioned above.

However, a --favor-state-selector does seem like a simpler way to achieve the same behaviour. The favor-state-selector is much more familiar to what many users are used to (e.g. swapping table refs in a SQL query or input paths in a python job).

dbt compile --target prod --target-path prod-run-artifacts
dbt compile --select model_b+ --defer --favor-state --state prod-run-artifacts --target dev1 --target-path hybrid1-run-artifacts
dbt run --select model_c --defer --favor-state --state hybrid-run-artifacts --target dev

vs.

dbt compile --target prod --target-path prod-run-artifacts
dbt run -s model_c --defer --favor-state-selector model_a --state prod-run-artifacts

I understand if this isn't prioritized as it is not new functionality. No worries about that, thanks again for teaching me about the n-layer burrito!

@dbeatty10
Copy link
Contributor

@d-cole Thanks for the describing a tricky scenario so well and exploring each of the edge cases to consider 🧠

Indeed, the proposed --favor-state-selector would be a little less verbose than layering via --target-path.

But since we can already support the end goal with current functionality, I'm going to close this as "not planned".

@jtcohen6
Copy link
Contributor

jtcohen6 commented May 8, 2024

@dbeatty10 I'm sorry to say that I've removed the possibility of the n-layer burrito, with some recent changes we made to rationalize the behavior of deferral (and resolve a thorny bug with unit tests):

In older versions of dbt-core, when a node was deferred, its manifest entry was completely overwritten with the node from the state manifest. In In dbt Core v1.8+, rather than completely overwriting the node, we're simply going to add an attribute (defer_relation), and determine at runtime whether to resolve ref to the referenced node's database.schema.identifier or the one specified in its defer_relation.

(cc @MichelleArk - this is what we were talking about a few weeks ago)

Going back to the original example:

  • The problem is that model_a and model_b both exist in the dev schema, and potentially expensive to recompute.
  • You want to use model_b, while ignoring the existence of model_a. (Why exactly? Is it lingering from previous unrelated work on a different branch?)

Rather than dropping specific tables (difficult to manage and potentially expensive to recompute), I wonder if a more straightforward approach might be to switch target schemas, and use dbt clone to copy the specific resources you want from prod / your previous dev schema (and as an alternative to dropping specific models). That's still a form of "full control" state management, but it feels easier to reason about (IMO) than juggling specific relations in --favor-state.

@d-cole
Copy link
Contributor Author

d-cole commented May 8, 2024

@jtcohen6

Going back to the original example:
You want to use model_b, while ignoring the existence of model_a. (Why exactly? Is it lingering from previous unrelated work on a different branch?)

There are two scenario's that come to mind.

  1. When multiple users are developing in the same environment, dropping A isn't always feasible as you don't know the impact it could have on others. This may be due to the specifics of how users we support are using dbt, and we are moving towards doing more work in isolated environments where dropping A is acceptable. However, this still requires dropping/cloning tables to achieve a node selection behaviour which is arguably more work than replacing ref() with a fully qualified table name. Modifying the database in order for refs to resolve as expected feels like a hack to work around defers behaviour.

  2. This scenario is less common, but is definitely something I've done this in the past. Say you are developing C and B and find C_dev <- (A_prod, B_dev) is not as you would expect. It could be an issue with the logic in C and B, but its also possible there was something you previously failed to account for in model A. In that scenario, to debug this, you may modify A and compare the result of C <- (A_prod, B_dev) to C <- (A_dev, B_dev). In this case, keeping A_dev around is useful to compare to A_prod or to rerun C.

I wonder if a more straightforward approach might be to switch target schemas, and use dbt clone to copy the specific resources you want

The clone approach does work and I've seen a fair bit of its use. The downside of the clone approach is:

  • Suffers from the same downsides as described in (1) if multiple users are working in the same environment.
  • Requires an extra command to run.

With referencing upstreams from a different target, the experience I've found to be the most intuitive is one that is closest to swapping out table references in SQL. I suspect this is because that is a very common thing to do when iterating on SQL outside of dbt. The closer the defer experience can be to the flexibility that provides, the more use cases it will address.

Due to the complexity of defers interface, I've come across a few solutions that just override ref or generate_schema/database_name in order to provide a --favor-state-selector-like experience through variables (e.g. --vars <prod_target>: [model_A]). Overriding ref is not ideal and providing that flexibility through a --favor-state-selector would allow for the same selection behaviour without requiring dropping or cloning tables.

@jtcohen6
Copy link
Contributor

@d-cole Really helpful, thank you! I've been reflecting more on this, especially now that the heavy-lifting conditional logic for --defer and --favor-state live much closer to where refs are actually resolved at runtime.

Regarding the two workflows you've outlined:

  • (1) I'm skeptical of multiple users sharing the same development environment. These should be cheap & plentiful, aided by mechanisms like clone to quickly "set up" a virtual environment. Unless - is the idea that each feature branch gets its own environment, and multiple users want to be collaborating in the same dev environment for the same feature branch?
  • (2) This one I think I get! I've made some changes to model_a, but they don't seem quite right as reflected in model_C, and so as a debugging exercise I want to try rerunning model_c [dev] against model_a [prod] without having to first drop model_a [dev]. Meanwhile, model_b [dev] is not a problem, and so model_c [dev] should continue to use it, but without also needing to select and rerun it each time.

This is the part that resonated most with me:

With referencing upstreams from a different target, the experience I've found to be the most intuitive is one that is closest to swapping out table references in SQL. I suspect this is because that is a very common thing to do when iterating on SQL outside of dbt. The closer the defer experience can be to the flexibility that provides, the more use cases it will address.

What if this looked exactly like swapping out the reference within your SQL? If it's part of my debugging of model_c, I would like to force deferral / state favorisation for one upstream ref but not for the other. Is that better

This is pretty ugly, though not as bad as I expected:

-- macros/ref_from_state.sql
{% macro ref_from_state(model_name) %}
  -- at parse time, just return a simple ref to capture the dependency
  {{ return(ref(model_name)) if not execute }}
  {% for node in graph.nodes.values() %}
    {% if node.name == model_name %}
      -- favor defer_relation if available, unless the upstream model is also selected in current run
      {% set rel = node.defer_relation
          if (node.defer_relation and node.unique_id not in selected_resources)
          else node %}
      {{ return(api.Relation.create(rel.database, rel.schema, rel.alias)) }}
    {% endif %}
  {% endfor %}
{% endmacro %}
-- models/model_c.sql

with model_a as (
    -- select * from {{ ref('model_a') }}
    -- I am manually swapping this to prefer the defer_relation defined in --state manifest
    select * from {{ ref_from_state('model_a') }}
),

model_b as (
    select * from {{ ref('model_b') }}
),

...

I think my primary hesitation stems from the number of flags we already have for very similar (and complex) functionality:

  • --defer (bool) - whether to use provided state for resolving references to unbuilt upstream models
  • --state (Path) - artifact to use for state comparison, and deferral by default
  • --defer-state (Path) - to artifact to use for deferral only, so as to be separate from --state
  • --favor-state (bool) - whether to use provided state for resolving references to unselected upstream models, taking precedence over --defer - it might be better called something like --force-defer
  • (proposed)--favor-state-selector (List[SelectionSpec]) - always use provided state for resolving these specific references to unselected upstream models

Both --defer and --favor-state are on/off switches that bake in their own logic about whether to use dev or prod. Maybe they should actually be "dimmer switches," where the light gradients are definable using selection logic. But I really hesitate to add even more complexity here, which is ultimately shifting the burden of the problem to our documentation and the end user.

@d-cole
Copy link
Contributor Author

d-cole commented May 15, 2024

@jtcohen6 Thank you for looking into this!

What if this looked exactly like swapping out the reference within your SQL? If it's part of my debugging of model_c, I would like to force deferral / state favorisation for one upstream ref but not for the other. Is that better

This can achieve the desired behaviour but requires altering the SQL which seems like the toil defer is aiming to remove. I could just swap {{ ref('model_a') }} to the production fqn and use defer with roughly the same effort as swapping to this macro.

I think my primary hesitation stems from the number of flags we already have for very similar (and complex) functionality:

Good point, the cli flags can be overwhelming. What do you think of this?

  • For the CLI itself, instead of adding --favor-state-selector, --favor-state is modified to optionally accept a List[SelectionSpec]. Using it would look like:
dbt run -s C --defer --state prod-run-artifacts --favor-state # Same behaviour as now
dbt run -s C --defer --state prod-run-artifacts --favor-state A # New option to favor selected nodes only
  • In Dbt Cloud this cli complexity can be hidden from users. When the "Defer to production" toggle is selected, a "Upstreams to Defer" selection would appear. Unspecified it defaults to current behaviour, but gives users the option to specify nodes. Just an idea, I'm sure there are other ways the cloud ui could hide this complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

4 participants