-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subselectors for state:modified #2704
Comments
There's been some discussion recently (slack thread) about the plausibility of a subselector that would enable us to differentiate between:
I think representing each of those with a separate An alternative approach, within the CI job definition, is to run all modified models, plus any models that are the immediate ancestor of a modified test: dbt run -m state:modified 1+state:modified,1+test_type:schema 1+state:modified,1+test_type:data --defer --state ...
dbt test -m state:modified --state ... The syntax is verbose, and it probably includes a few resources it doesn't need to, but overall this approach is compelling. It also manages to handle cases where one leg of a Previously, we were thinking that we'd resolve this wrinkle via [updated 10/15 to use |
Well, I've made this pitch before so I guess I'll do it again for old times' sake: I think the best way to do this is to make it so that That would mean we could get rid of the goofy concept of selector expansion, which only exists to preserve this behavior, where dependent tests are added in after selection runs for |
@beckjake Ahh. Thanks for bringing that back. We do have a larger rework of tests on the horizon, so the time could be ripe for changing this change once and for all. Do I have it right that the biggest trade-off here would be losing the ability to do things like:
Today, this means "run all tests that ref any resource that depends on For the purposes of this issue, I'll take that as an argument against trying to code up |
I think |
I do think it'd be reasonable to create some extra syntax for "all the tests that test the matched selector" to go along with this change. I'm not sure dbt is really ready for meta-selectors or whatever that would be called, but maybe it's more reasonable than I'm imagining. I do get that it's enticing to be able to say "take the dbt run command I just ran and run all of the tests for it". I just think that currently, it's very easy to do that and very painful to avoid selecting particular tests, due to the nature of selector expansion. |
Ok, this clicked for me. Far less breaking than I was thinking above. I just went back to reread this, too, while we're here. Was there a time that we previously decided against doing away with test selector expansion? Or did we not reach a verdict, and kept the status quo?
Right, I think this makes sense. I mean... maybe
I strongly agree. I think this discussion could become a v1.0 issue that we hit along with the other v1.0 issues for tests later this year. In a world where this behavior exists, the mutually exclusive groups I mentioned above could be specified as:
|
Yeah, long before #2203 even! I'm not sure where it was, it may unfortunately even have been private messages on slack or something. I recall it was considered just too much of a breaking change and it made it difficult for professional services (and therefore, presumably everyone else who uses dbt, but I think that was the source of the feedback?).
At first I thought this was a great idea but then I remembered that |
Weighing in here with a minor comment from Slack (thread) that relates to #2701: I'd like to keep the scenario described by @jtcohen6 on that issue alive, since it's quite a common one and doesn't seem to have a clear solution at the moment:
An example scenario would be the one I describe in the Slack thread, where a data test references both a new model and an existing unmodified model (which might not be anywhere in the graph neighbourhood of the new model), and current deferral functionality will not defer either model (while I would expect it to defer the existing unmodified model). |
Adding a little bit more context that emerged: we arrived at a solution for this particular case that dragged in a few extra models (by using the A possible solution for this would be adding support for order of operations - on the one hand, maybe this is overengineering (and it makes maintaining compatibility between the CLI and the YAML selectors a bit of a pain), but on the other hand lack of control over the order here does rule out some common scenarios like the one outlined in my previous comment. This could look something like: selectors:
- name: parents_of_data_tests
definition:
# run this intersection first
intersection:
- 'state:modified'
- 'test_type:data'
# then get parents of the resulting models
parents: true
parents_depth: 1 ... and the equivalent in the CLI: dbt ls -s 1+(state:modified,test_type:data) --state ./target/ |
@jtcohen6 just here to give a big 👍 to the idea of modification subselectors. Specifically, |
This definitely looks promising! Up to now, current The |
This is a follow-on from the initial proposals in #2465, #2641. Most of the required work is around exposing + sugaring the foundational work in #2695.
The way state comparison works is by identifying discrepancies between two manifests. When comparing between a past prod manifest and the current development manifest, discrepancies can be the result of two things:
target
, env vars, etc.We're going to do our best to capture only changes that are the result of development. If someone's project has tons of intricate env-aware logic, they'll run more models than they want (i.e. more false positives). So we're giving them the option to turn off some knobs, in the form of more-specific subselectors.
Subselectors
There is potential for overlap: a single change can trigger multiple modification categories.
state:modified.contents
:This alone would get a lot of people what they want! It's basically "just hash the files," excluding YAML configuration.
state:modified.configs
:dbt_project.yml
or{{config()}}
blocks. If the changes are made in a{{config()}}
block, they will also be picked up as content changes.state:modified.descriptions
:persist_docs
is turned on for a node, description changes count as modifications. (If just columns, just column descriptions; just relations, top-level descriptions; if both, then both.)state:modified.database_representations
:database
,schema
,identifier
. This value represents the manual input only, and it's different from the resolved database representation, which depends on thetarget
andgenerate_x_name
macros.schema = target.schema
, orschema = target.schema ~ '_suffix'
instead of using thegenerate_schema_name
macro, that will register as a change between environments and they'll want to turn this off.generate_x_name
logic and the current environment, a chance to the configured value may not actually change the database representation. We'll still register it as a modification.database
,schema
, oridentifier
has changed. If someone has env-aware definitions, they'll want to turn this off.Default behavior
I think
state:modified
should include all changes from all the categories above. The question mark is whetherdatabase_representations
should be included in the default, since this is the area where people do the most custom things, and it's the knob that will likely be switched off most often. For the sake of clarity, I think it's best to have thestate:modified
selector be a superset of allmodified
subselectors.Future art
state:modified.macros
:state:modified.macros+
would include all downstream models, tests, etc. that call (directly or indirectly) a macro that has changedgenerate_schema_name
state:modified.vars
:state:modified.vars+
would include all downstream models, tests, etc. that call (directly or indirectly) a var that has changedWe will update
state:modified
to include both of these as well.The text was updated successfully, but these errors were encountered: