Use common columns for incremental schema changes #4170

Kayrnt · 2021-10-30T13:42:08Z

resolves #4144

Description

In this PR, process_schema_changes is modified to return the dict with schema changes information so that the column changes in incremental materialization can be managed by plugins (dbt-bigquery, dbt-snowflake).

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR => tests on adapters
I have updated the CHANGELOG.md and added information about my change

jtcohen6 · 2021-11-05T08:53:31Z

@Kayrnt I really like the approach you're taking, here + in dbt-labs/dbt-snowflake#38 + dbt-labs/dbt-bigquery#51.

When would you feel ready for a review?

github-christophe-oudar · 2021-11-05T14:10:03Z

I had little free time so far this week (lot of work) but hopefully I'll have some time this weekend, I think it should be working but I didn't had time to test so far. TBH I'm quite struggling on how to have a local build of DBT bigquery / snowflake packages packaged locally and integrated with a local DBT build as well if you have some contributing guide such as https://github.com/dbt-labs/dbt-core/blob/main/CONTRIBUTING.md for those package that would be pretty helpful! (I couldn't have setuptools to work as intended) 🙏

So far what I think I have to do:

Test locally the PR fixes the issue on BQ (I don't have a snowflake env)
Add/modify an IT on Snowflake

jtcohen6 · 2021-11-05T15:56:12Z

@github-christophe-oudar Makes sense! You're right, we should update the contributing guide to talk about coordinating changes to multiple (non-Postgres) adapters.

What I do, roughly, is clone each package (dbt-core, dbt-bigquery, dbt-snowflake) locally, and then:

cd scratch-project
python3 -m venv env
source env/bin/activate
pip install -e <path/to/core/repo>/core -e <path/to/snowflake/repo> -e <path/to/bigquery/repo>

Since the only change to dbt-core are to macros (process_schema_changes etc), you can also test the changes in your local project by copy-pasting the edited versions of those macros into your local macros/ folder. That might be easier, you just need to remember to copy the edited macros back into the dbt-core PR when you're done :)

github-christophe-oudar · 2021-11-06T01:11:51Z

Thanks for the commands @jtcohen6!
I used pip install -e ../../tools/dbt-bigquery -e ../../tools/dbt -e ../../tools/dbt-snowflake and it worked... but the code from the dbt-core part wasn't used at the runtime: it's weird because changes on dbt-bigquery are properly reflected with that pip install. However I used the /macros directory trick from my model project and verified it worked as intended... actually there was a bug I fixed (it's the source and not the target for add column as target = table existing, source = the tmp table).

However I'm not sure for the snowflake part :/
Could you help me with that?

jtcohen6

Ok! I took this for a spin locally, and I've got a couple of quick thoughts:

We can implement this for Postgres + Redshift as well! I don't know why this only just occurred to me. The added bonus is, being able to test out these changes in the dbt-core repo first, since we've got Postgres integration tests here.
Let's aim to keep the materialization code as simple as possible. If there needs to be conditional logic specific to different on_schema_change options, let's handle it within the macros called by process_schema_changes. As it is, I think both append + sync modes want to use source_columns as dest_columns, so there's no need for conditional branching. The logic in materializations could just be:

    {#-- Process schema changes. Returns dict of changes if successful. Use source columns for upserting/merging --#}
    {% set schema_changes_dict = process_schema_changes(on_schema_change, tmp_relation, existing_relation) %}
    {% set dest_columns = schema_changes_dict.get('source_columns') %}
    {% if not dest_columns %}
      {#-- Otherwise, get columns in destination for upserting/merging --#}
      {% set dest_columns = adapter.get_columns_in_relation(existing_relation) %}
    {% endif %}

Here's a commit that implements this for core/postgres + adds an integration test: b26d5a7. Feel free to use/adapt. Once we get this right for core/postgres

it should "just work" on Redshift (we could add the integration test there to confirm)
the PRs in Snowflake + BigQuery should be very straightforward

jtcohen6 · 2021-11-07T13:07:48Z

core/dbt/include/global_project/macros/materializations/incremental/on_schema_change.sql

@@ -57,7 +65,8 @@
  {%- set target_columns = adapter.get_columns_in_relation(target_relation) -%}
  {%- set source_not_in_target = diff_columns(source_columns, target_columns) -%}
  {%- set target_not_in_source = diff_columns(target_columns, source_columns) -%}
-
+  {%- set in_target_and_source = intersect_columns(target_columns, source_columns) -%}


We need to add in source_columns here, as that's what we'll want to use for dest_columns after process_schema_changes has processed

jtcohen6 · 2021-11-07T13:08:28Z

CHANGELOG.md

@@ -11,10 +11,12 @@
 ### Under the hood
 - Bump artifact schema versions for 1.0.0: manifest v4, run results v4, sources v3. Notable changes: schema test + data test nodes are renamed to generic test + singular test nodes; freshness threshold default values ([#4191](https://github.com/dbt-labs/dbt-core/pull/4191))
 - Speed up node selection by skipping `incorporate_indirect_nodes` if not needed ([#4213](https://github.com/dbt-labs/dbt-core/issues/4213), [#4214](https://github.com/dbt-labs/dbt-core/issues/4214))
+- [SF, BQ] When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144))


Suggested change

- [SF, BQ] When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144))

- When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144), [#4170](https://github.com/dbt-labs/dbt-core/pull/4170))

…emoval

github-christophe-oudar · 2021-11-07T13:57:29Z

Hello @jtcohen6, thank you for your suggestions!
Indeed I didn't look into Postgres/Redshift but if those DBs also support incremental on DBT, that would be a shame not to support them too.

I'll have a look at your other review on BQ adapter.
I thought too that the branching was a bit overkill, it was a "safety measure" in case of a custom strategy that wouldn't update the target table in the DB before the merge code that would result in failing query.
I'll remove it and I guess we can expect someone making a custom strategy making the appropriate changes if needed.

I cherry-picked your commit and added the review change to the changelog.
It looks like tests are passing.

jtcohen6 · 2021-11-07T14:42:59Z

I thought too that the branching was a bit overkill, it was a "safety measure" in case of a custom strategy that wouldn't update the target table in the DB before the merge code that would result in failing query.
I'll remove it and I guess we can expect someone making a custom strategy making the appropriate changes if needed.

That's a fair thought! The alternative is, rather than having process_schema_changes return the full schema_changes_dict, it could return just the list of common_columns (i.e. dest_columns to use when merging/upserting). Then, the materialization code just pulls in the returned value:

    {#-- Process schema changes. Returns dict of changes if successful. Use source columns for upserting/merging --#}
    {% set dest_columns = process_schema_changes(on_schema_change, tmp_relation, existing_relation) %}
    {% if not dest_columns %}
      {% set dest_columns = adapter.get_columns_in_relation(existing_relation) %}
    {% endif %}

If a user overrides the on_schema_change macro family with a custom mode, they'd just need process_schema_changes to return the right set of columns as a last step, and they wouldn't need to reimplement the incremental materialization as well.

What do you think?

jtcohen6 · 2021-11-07T14:43:43Z

(closing and re-opening to trigger Postgres integration tests, now that I've added ok to test label)

github-christophe-oudar · 2021-11-07T15:22:50Z

I'm ok with both approaches:

Either we have a quite exhaustive set of generic columns related informations (source, target, intersection, disjunction, ...) as done in this PR so far and then the incremental strategy pick what's matching.
Go for the one you suggested, having check_for_schema_changes return only what's needed as so far, only source_columns appears relevant for that case.

If we want to keep it lean, going for returning as little as necessary might be better indeed (especially since this is done on every run from all users of incremental models).

Let me know if you think I should make the change.

jtcohen6 · 2021-11-07T15:27:10Z

@github-christophe-oudar Let's go lean!

I suppose it is possible to imagine an incremental strategy that would want other information, besides the set of common columns (post-schema evolution). All strategies that exist today want to operate on a single set of shared columns, named dest_columns. Nor is it possible for users to define and register totally new incremental strategies, without also reimplementing several other macros and/or the incremental materialization (but it's worth thinking more about: #2366).

github-christophe-oudar · 2021-11-07T16:20:03Z

Ok I updated the 3 PRs with related code 👍
You're right, as so far, I don't really see a common incremental strategy that would require more than those common fields.

jtcohen6

@Kayrnt This looks great to me! Do you feel ready to have this merged?

As soon as the code is merged into dbt-core, then the plugin PRs will be able to use this latest code in their CI. I see you've updated the dbt-snowflake + dbt-bigquery PRs to use the equivalent materialization logic. Could I trouble you to also add the new integration test case (run_incremental_append_new_columns_remove_one) as part of those PRs? You may also need to pull/rebase to get the latest test fixes in main.

If you feel up to it, you could add just the integration test case to dbt-redshift as well. No substantive changes needed :)

github-christophe-oudar · 2021-11-07T16:28:25Z

Sure let's go ahead for this one! 🚀
Regarding dbt-snowflake & dbt-bigquery, sure I can handle adding a new integration test case, I gotta go but I should be able to get those added in the evening. I'll see if I can get one more in dbt-redshift as well.

I'm used to fetch & rebase whenever I push so no worry we should be good 👍

jtcohen6

@Kayrnt Rock on !

Merci beaucoup d'avoir contribué :)

github-christophe-oudar · 2021-11-07T17:19:15Z

Great that we're able to move quite fast on this topic!
I think it's one of my latest pain point regarding DBT! I'm looking forward to using these changes in production 🙌

Merci de passer une partie de ton dimanche pour faire avancer tout ça ! 🙏

cla-bot bot added the cla:yes label Oct 30, 2021

Kayrnt force-pushed the on_schema_change_columns_fix branch 4 times, most recently from 5769d61 to 9c66146 Compare October 31, 2021 00:35

This was referenced Oct 31, 2021

Use common columns for incremental schema changes dbt-labs/dbt-snowflake#38

Merged

Use common columns for incremental schema changes dbt-labs/dbt-bigquery#51

Merged

Kayrnt force-pushed the on_schema_change_columns_fix branch 2 times, most recently from db9a132 to 1892607 Compare November 1, 2021 20:39

Kayrnt force-pushed the on_schema_change_columns_fix branch from 1892607 to f9f69aa Compare November 6, 2021 01:05

Kayrnt force-pushed the on_schema_change_columns_fix branch from f9f69aa to 5624445 Compare November 6, 2021 01:22

Use common columns for incremental schema changes

cbd378a

Kayrnt force-pushed the on_schema_change_columns_fix branch from 5624445 to cbd378a Compare November 6, 2021 01:24

Kayrnt marked this pull request as ready for review November 6, 2021 01:30

jtcohen6 requested changes Nov 7, 2021

View reviewed changes

jtcohen6 and others added 2 commits November 7, 2021 14:44

on_schema_change:append_new_columns should gracefully handle column r…

6dd30e4

…emoval

review changes

1676882

jtcohen6 added the ok_to_test label Nov 7, 2021

jtcohen6 closed this Nov 7, 2021

jtcohen6 reopened this Nov 7, 2021

Kayrnt force-pushed the on_schema_change_columns_fix branch from 2b8de32 to 33c8b90 Compare November 7, 2021 15:43

Lean approach for process_schema_changes to simplify

267a2a6

Kayrnt force-pushed the on_schema_change_columns_fix branch from 33c8b90 to 267a2a6 Compare November 7, 2021 15:44

jtcohen6 approved these changes Nov 7, 2021

View reviewed changes

jtcohen6 merged commit b591e1a into dbt-labs:main Nov 7, 2021

Kayrnt mentioned this pull request Nov 7, 2021

Add tests on incremental materialization dbt-labs/dbt-redshift#37

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use common columns for incremental schema changes #4170

Use common columns for incremental schema changes #4170

Kayrnt commented Oct 30, 2021 •

edited

Loading

jtcohen6 commented Nov 5, 2021

github-christophe-oudar commented Nov 5, 2021

jtcohen6 commented Nov 5, 2021

github-christophe-oudar commented Nov 6, 2021

jtcohen6 left a comment •

edited

Loading

jtcohen6 Nov 7, 2021

jtcohen6 Nov 7, 2021

github-christophe-oudar commented Nov 7, 2021

jtcohen6 commented Nov 7, 2021

jtcohen6 commented Nov 7, 2021

github-christophe-oudar commented Nov 7, 2021

jtcohen6 commented Nov 7, 2021

github-christophe-oudar commented Nov 7, 2021

jtcohen6 left a comment

github-christophe-oudar commented Nov 7, 2021

jtcohen6 left a comment

github-christophe-oudar commented Nov 7, 2021

	- [SF, BQ] When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144))
	- When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144), [#4170](https://github.com/dbt-labs/dbt-core/pull/4170))

Use common columns for incremental schema changes #4170

Use common columns for incremental schema changes #4170

Conversation

Kayrnt commented Oct 30, 2021 • edited Loading

Description

Checklist

jtcohen6 commented Nov 5, 2021

github-christophe-oudar commented Nov 5, 2021

jtcohen6 commented Nov 5, 2021

github-christophe-oudar commented Nov 6, 2021

jtcohen6 left a comment • edited Loading

Choose a reason for hiding this comment

jtcohen6 Nov 7, 2021

Choose a reason for hiding this comment

jtcohen6 Nov 7, 2021

Choose a reason for hiding this comment

github-christophe-oudar commented Nov 7, 2021

jtcohen6 commented Nov 7, 2021

jtcohen6 commented Nov 7, 2021

github-christophe-oudar commented Nov 7, 2021

jtcohen6 commented Nov 7, 2021

github-christophe-oudar commented Nov 7, 2021

jtcohen6 left a comment

Choose a reason for hiding this comment

github-christophe-oudar commented Nov 7, 2021

jtcohen6 left a comment

Choose a reason for hiding this comment

github-christophe-oudar commented Nov 7, 2021

Kayrnt commented Oct 30, 2021 •

edited

Loading

jtcohen6 left a comment •

edited

Loading