Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor deferral: always merge_from_artifact & support all commands #9040

Merged
merged 9 commits into from
Jan 17, 2024

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented Nov 8, 2023

resolves #7965
resolves #8715
throwback: #2740

Problem

  • Weird logical divergence between clone and other tasks in terms of how deferral works. (I borrowed some of @MichelleArk's spike work from spike: inferred contracts #8339.)
  • Some commands don't support --defer & --state, but they shouldn't raise an error. That's the case currently (e.g. if you submit run-operation within the IDE while "Defer to production" is enabled)

Solution

  • Consolidate how deferral works for all runnable tasks
  • Support --defer & friends as "global" flags

@runleonarun adding copilot summary

This pull request primarily focuses on consolidating deferral methods and flags within the dbt project. The changes involve modifications to various methods and classes across multiple files, including core/dbt/cli/flags.py and core/dbt/cli/main.py. The most significant changes include the addition of a new params_assigned_from_user set and changes to the _assign_params method, reordering and removal of decorators in core/dbt/cli/main.py, and the introduction of defer_relation in core/dbt/contracts/graph/manifest.py.

  1. Changes to _assign_params method in core/dbt/cli/flags.py:

    • A new set params_assigned_from_user is introduced and used to track user-provided and default values across both 'parent' and 'child' levels. This is to support detection of mutually exclusive flags later on. [1] [2]
    • The _assign_params method is modified to include params_assigned_from_user as a parameter. [1] [2]
  2. Reordering and removal of decorators in core/dbt/cli/main.py:

    • Several decorators are added to the global_flags function and removed from other functions. This change is likely aimed at consolidating the use of these decorators. [1] [2] [3] and others)
  3. Introduction of defer_relation in core/dbt/contracts/graph/manifest.py:

    • For non-ephemeral refable nodes, a new defer_relation attribute is added. This attribute is used when the model is deferred and the adapter doesn't support zero-copy cloning. [1] [2]
  4. Changes in core/dbt/task/clone.py:

    • The CloneTask class now always requires a state manifest, regardless of whether the --defer flag has been set.
    • The defer_to_manifest method is removed. Unlike other commands, 'clone' always requires a state manifest.
  5. Consolidation of deferral methods and flags:

    • An 'Under the Hood' change is added to consolidate deferral methods and flags.

These changes seem to be part of a larger refactoring effort aimed at simplifying the codebase and improving the handling of deferral methods and flags.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
  • This PR includes type annotations for new and modified functions

@cla-bot cla-bot bot added the cla:yes label Nov 8, 2023
Copy link
Contributor

github-actions bot commented Nov 8, 2023

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link

codecov bot commented Nov 8, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (a1f78a8) 86.65% compared to head (1156ca2) 86.60%.

Files Patch % Lines
core/dbt/task/runnable.py 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9040      +/-   ##
==========================================
- Coverage   86.65%   86.60%   -0.06%     
==========================================
  Files         230      230              
  Lines       27060    26987      -73     
==========================================
- Hits        23449    23372      -77     
- Misses       3611     3615       +4     
Flag Coverage Δ
integration 83.48% <97.50%> (-0.13%) ⬇️
unit 65.22% <75.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jtcohen6 jtcohen6 force-pushed the jerco/7965-refactor-deferral branch 2 times, most recently from 67543e0 to edf37c0 Compare November 8, 2023 21:46
@jtcohen6 jtcohen6 changed the title Refactor deferral: always merge_from_artifact & support all commands Refactor deferral: always merge_from_artifact & support all commands Nov 8, 2023
@jtcohen6 jtcohen6 force-pushed the jerco/7965-refactor-deferral branch from 51642a3 to b9a61d0 Compare November 8, 2023 23:08
@jtcohen6 jtcohen6 force-pushed the jerco/7965-refactor-deferral branch from 86415da to ed0c432 Compare December 3, 2023 18:45
@jtcohen6 jtcohen6 requested a review from MichelleArk December 3, 2023 18:45
@jtcohen6 jtcohen6 marked this pull request as ready for review December 3, 2023 18:45
@jtcohen6 jtcohen6 requested a review from a team as a code owner December 3, 2023 18:45
favor_state=bool(self.args.favor_state),
)
# TODO: is it wrong to write the manifest here? I think it's right...
write_manifest(self.manifest, self.config.project_target_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to gate on self.config.args.write_json here, similar to what's done when the manifest is initially written here: https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/cli/requires.py#L277

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

longer term.. I wonder if it'd actually make more sense to do this deferral + merging business in the requires.manifest wrapper to centralize where manifest loading + writing happens. It seems like at this point there is some command-specific functionality which justifies this functionality living at the task-level for now though

Copy link
Contributor Author

@jtcohen6 jtcohen6 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest reason to have this at the task level right now is that we're running database queries (= populating + check the adapter cache) while calculating deferral, to know which selected models exist in the expected target schema(s). Everything in requires.manifest ("parsing") takes place before we've made any database connections.

We could try more substantially changing the way that deferral works: During manifest loading, we add defer_relation as an attribute to every node. Then, during each node's compilation, the RuntimeRefResolver checks to see if that node exists in the expected schema (cache lookup) — or skips the check if --favor-state — and resolves to defer_relation instead if not found. We wouldn't be recording in the manifest that a given node was deferred — but that doesn't feel like such a big loss, and it would still be reflected in the compiled_code of other nodes downstream.

That feels significantly tidier. What's your (& my) appetite to try doing that here? Or in a follow-up PR? :)


In fact, come to think of it, this bit of the clone task now doesn't make nearly as much sense, because defer_to_manifest needs to populate the adapter cache as part of resolving which nodes are "deferred" and which just get a defer_relation, which isn't even a meaningful distinction for the purpose of clone:

# unlike in other tasks, we want to add information from the --state manifest *before* caching!
self.defer_to_manifest(adapter, selected_uids)


Risk of breaking changes? Currently, an unselected node will be actually overwritten with the node entry from the stateful manifest. Every single attribute. If you write graph-parsing Jinja logic, and traverse to that node's entry, those are the attributes you'd see. So custom integrations which write those node attributes back to the data warehouse (for example) might be a little bit funky. I think that's quite niche as backwards-incompatible behavior changes go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thought got the better of me — here's a first attempt:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue working on the follow-up refactor once this is merged to de-risk the scope of these changes, especially with all the other under-the-hood refactoring going in this release :)

@dbeatty10
Copy link
Contributor

This covers 3 out of the 4 UserConfig missing from global configs that showed up in a separate and independent audit.

What about indirect_selection? It is the only 1 of the 4 that showed up there but is not covered here.

@@ -62,10 +62,6 @@ def print_result_line(self, result):


class SeedTask(RunTask):
def defer_to_manifest(self, adapter, selected_uids):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @MichelleArk: Should we actually still disable deferral for tasks like seed/freshness/list, since those tasks never (in theory) resolve runtime references?

It's technically possible to have a seed reference another node, via a pre/post hook. Or write custom code in a project (on-run-*) hook that wants to use the defer_relation property of nodes in the graph. I don't think that's good, but it is possible.

At this point it feels like a choice between consistency, or a (small) performance optimization. Either way, I want all commands to support the flags as CLI options. We could keep the overrides in for now, until the follow-up refactor (#9199) that removes deferral logic from tasks entirely, and makes it part of manifest loading. That does feel like the right direction, and this is a step on the way there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants