-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use common columns for incremental schema changes #4170
Conversation
5769d61
to
9c66146
Compare
db9a132
to
1892607
Compare
@Kayrnt I really like the approach you're taking, here + in dbt-labs/dbt-snowflake#38 + dbt-labs/dbt-bigquery#51. When would you feel ready for a review? |
I had little free time so far this week (lot of work) but hopefully I'll have some time this weekend, I think it should be working but I didn't had time to test so far. TBH I'm quite struggling on how to have a local build of DBT bigquery / snowflake packages packaged locally and integrated with a local DBT build as well if you have some contributing guide such as https://github.com/dbt-labs/dbt-core/blob/main/CONTRIBUTING.md for those package that would be pretty helpful! (I couldn't have setuptools to work as intended) 🙏 So far what I think I have to do:
|
@github-christophe-oudar Makes sense! You're right, we should update the contributing guide to talk about coordinating changes to multiple (non-Postgres) adapters. What I do, roughly, is clone each package (dbt-core, dbt-bigquery, dbt-snowflake) locally, and then:
Since the only change to dbt-core are to macros ( |
1892607
to
f9f69aa
Compare
Thanks for the commands @jtcohen6! However I'm not sure for the snowflake part :/ |
f9f69aa
to
5624445
Compare
5624445
to
cbd378a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! I took this for a spin locally, and I've got a couple of quick thoughts:
- We can implement this for Postgres + Redshift as well! I don't know why this only just occurred to me. The added bonus is, being able to test out these changes in the
dbt-core
repo first, since we've got Postgres integration tests here. - Let's aim to keep the materialization code as simple as possible. If there needs to be conditional logic specific to different
on_schema_change
options, let's handle it within the macros called byprocess_schema_changes
. As it is, I think both append + sync modes want to usesource_columns
asdest_columns
, so there's no need for conditional branching. The logic in materializations could just be:
{#-- Process schema changes. Returns dict of changes if successful. Use source columns for upserting/merging --#}
{% set schema_changes_dict = process_schema_changes(on_schema_change, tmp_relation, existing_relation) %}
{% set dest_columns = schema_changes_dict.get('source_columns') %}
{% if not dest_columns %}
{#-- Otherwise, get columns in destination for upserting/merging --#}
{% set dest_columns = adapter.get_columns_in_relation(existing_relation) %}
{% endif %}
Here's a commit that implements this for core/postgres + adds an integration test: b26d5a7. Feel free to use/adapt. Once we get this right for core/postgres
- it should "just work" on Redshift (we could add the integration test there to confirm)
- the PRs in Snowflake + BigQuery should be very straightforward
@@ -57,7 +65,8 @@ | |||
{%- set target_columns = adapter.get_columns_in_relation(target_relation) -%} | |||
{%- set source_not_in_target = diff_columns(source_columns, target_columns) -%} | |||
{%- set target_not_in_source = diff_columns(target_columns, source_columns) -%} | |||
|
|||
{%- set in_target_and_source = intersect_columns(target_columns, source_columns) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add in source_columns
here, as that's what we'll want to use for dest_columns
after process_schema_changes
has processed
CHANGELOG.md
Outdated
@@ -11,10 +11,12 @@ | |||
### Under the hood | |||
- Bump artifact schema versions for 1.0.0: manifest v4, run results v4, sources v3. Notable changes: schema test + data test nodes are renamed to generic test + singular test nodes; freshness threshold default values ([#4191](https://github.com/dbt-labs/dbt-core/pull/4191)) | |||
- Speed up node selection by skipping `incorporate_indirect_nodes` if not needed ([#4213](https://github.com/dbt-labs/dbt-core/issues/4213), [#4214](https://github.com/dbt-labs/dbt-core/issues/4214)) | |||
- [SF, BQ] When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [SF, BQ] When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144)) | |
- When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144), [#4170](https://github.com/dbt-labs/dbt-core/pull/4170)) |
Hello @jtcohen6, thank you for your suggestions! I'll have a look at your other review on BQ adapter. I cherry-picked your commit and added the review change to the changelog. |
That's a fair thought! The alternative is, rather than having {#-- Process schema changes. Returns dict of changes if successful. Use source columns for upserting/merging --#}
{% set dest_columns = process_schema_changes(on_schema_change, tmp_relation, existing_relation) %}
{% if not dest_columns %}
{% set dest_columns = adapter.get_columns_in_relation(existing_relation) %}
{% endif %} If a user overrides the What do you think? |
(closing and re-opening to trigger Postgres integration tests, now that I've added |
I'm ok with both approaches:
If we want to keep it lean, going for returning as little as necessary might be better indeed (especially since this is done on every run from all users of incremental models). Let me know if you think I should make the change. |
@github-christophe-oudar Let's go lean! I suppose it is possible to imagine an incremental strategy that would want other information, besides the set of common columns (post-schema evolution). All strategies that exist today want to operate on a single set of shared columns, named |
2b8de32
to
33c8b90
Compare
33c8b90
to
267a2a6
Compare
Ok I updated the 3 PRs with related code 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kayrnt This looks great to me! Do you feel ready to have this merged?
As soon as the code is merged into dbt-core
, then the plugin PRs will be able to use this latest code in their CI. I see you've updated the dbt-snowflake + dbt-bigquery PRs to use the equivalent materialization logic. Could I trouble you to also add the new integration test case (run_incremental_append_new_columns_remove_one
) as part of those PRs? You may also need to pull/rebase to get the latest test fixes in main
.
If you feel up to it, you could add just the integration test case to dbt-redshift
as well. No substantive changes needed :)
Sure let's go ahead for this one! 🚀 I'm used to fetch & rebase whenever I push so no worry we should be good 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kayrnt Rock on !
Merci beaucoup d'avoir contribué :)
Great that we're able to move quite fast on this topic! Merci de passer une partie de ton dimanche pour faire avancer tout ça ! 🙏 |
resolves #4144
Description
In this PR,
process_schema_changes
is modified to return the dict with schema changes information so that the column changes in incremental materialization can be managed by plugins (dbt-bigquery, dbt-snowflake).Checklist
CHANGELOG.md
and added information about my change