Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use common columns for incremental schema changes #4170

Merged
merged 4 commits into from
Nov 7, 2021

Conversation

Kayrnt
Copy link
Contributor

@Kayrnt Kayrnt commented Oct 30, 2021

resolves #4144

Description

In this PR, process_schema_changes is modified to return the dict with schema changes information so that the column changes in incremental materialization can be managed by plugins (dbt-bigquery, dbt-snowflake).

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR => tests on adapters
  • I have updated the CHANGELOG.md and added information about my change

@cla-bot cla-bot bot added the cla:yes label Oct 30, 2021
@Kayrnt Kayrnt force-pushed the on_schema_change_columns_fix branch 4 times, most recently from 5769d61 to 9c66146 Compare October 31, 2021 00:35
@Kayrnt Kayrnt force-pushed the on_schema_change_columns_fix branch 2 times, most recently from db9a132 to 1892607 Compare November 1, 2021 20:39
@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 5, 2021

@Kayrnt I really like the approach you're taking, here + in dbt-labs/dbt-snowflake#38 + dbt-labs/dbt-bigquery#51.

When would you feel ready for a review?

@github-christophe-oudar
Copy link

I had little free time so far this week (lot of work) but hopefully I'll have some time this weekend, I think it should be working but I didn't had time to test so far. TBH I'm quite struggling on how to have a local build of DBT bigquery / snowflake packages packaged locally and integrated with a local DBT build as well if you have some contributing guide such as https://github.com/dbt-labs/dbt-core/blob/main/CONTRIBUTING.md for those package that would be pretty helpful! (I couldn't have setuptools to work as intended) 🙏

So far what I think I have to do:

  • Test locally the PR fixes the issue on BQ (I don't have a snowflake env)
  • Add/modify an IT on Snowflake

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 5, 2021

@github-christophe-oudar Makes sense! You're right, we should update the contributing guide to talk about coordinating changes to multiple (non-Postgres) adapters.

What I do, roughly, is clone each package (dbt-core, dbt-bigquery, dbt-snowflake) locally, and then:

cd scratch-project
python3 -m venv env
source env/bin/activate
pip install -e <path/to/core/repo>/core -e <path/to/snowflake/repo> -e <path/to/bigquery/repo>

Since the only change to dbt-core are to macros (process_schema_changes etc), you can also test the changes in your local project by copy-pasting the edited versions of those macros into your local macros/ folder. That might be easier, you just need to remember to copy the edited macros back into the dbt-core PR when you're done :)

@Kayrnt Kayrnt force-pushed the on_schema_change_columns_fix branch from 1892607 to f9f69aa Compare November 6, 2021 01:05
@github-christophe-oudar
Copy link

Thanks for the commands @jtcohen6!
I used pip install -e ../../tools/dbt-bigquery -e ../../tools/dbt -e ../../tools/dbt-snowflake and it worked... but the code from the dbt-core part wasn't used at the runtime: it's weird because changes on dbt-bigquery are properly reflected with that pip install. However I used the /macros directory trick from my model project and verified it worked as intended... actually there was a bug I fixed (it's the source and not the target for add column as target = table existing, source = the tmp table).

However I'm not sure for the snowflake part :/
Could you help me with that?

@Kayrnt Kayrnt force-pushed the on_schema_change_columns_fix branch from f9f69aa to 5624445 Compare November 6, 2021 01:22
@Kayrnt Kayrnt force-pushed the on_schema_change_columns_fix branch from 5624445 to cbd378a Compare November 6, 2021 01:24
@Kayrnt Kayrnt marked this pull request as ready for review November 6, 2021 01:30
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I took this for a spin locally, and I've got a couple of quick thoughts:

  • We can implement this for Postgres + Redshift as well! I don't know why this only just occurred to me. The added bonus is, being able to test out these changes in the dbt-core repo first, since we've got Postgres integration tests here.
  • Let's aim to keep the materialization code as simple as possible. If there needs to be conditional logic specific to different on_schema_change options, let's handle it within the macros called by process_schema_changes. As it is, I think both append + sync modes want to use source_columns as dest_columns, so there's no need for conditional branching. The logic in materializations could just be:
    {#-- Process schema changes. Returns dict of changes if successful. Use source columns for upserting/merging --#}
    {% set schema_changes_dict = process_schema_changes(on_schema_change, tmp_relation, existing_relation) %}
    {% set dest_columns = schema_changes_dict.get('source_columns') %}
    {% if not dest_columns %}
      {#-- Otherwise, get columns in destination for upserting/merging --#}
      {% set dest_columns = adapter.get_columns_in_relation(existing_relation) %}
    {% endif %}

Here's a commit that implements this for core/postgres + adds an integration test: b26d5a7. Feel free to use/adapt. Once we get this right for core/postgres

  • it should "just work" on Redshift (we could add the integration test there to confirm)
  • the PRs in Snowflake + BigQuery should be very straightforward

@@ -57,7 +65,8 @@
{%- set target_columns = adapter.get_columns_in_relation(target_relation) -%}
{%- set source_not_in_target = diff_columns(source_columns, target_columns) -%}
{%- set target_not_in_source = diff_columns(target_columns, source_columns) -%}

{%- set in_target_and_source = intersect_columns(target_columns, source_columns) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add in source_columns here, as that's what we'll want to use for dest_columns after process_schema_changes has processed

CHANGELOG.md Outdated
@@ -11,10 +11,12 @@
### Under the hood
- Bump artifact schema versions for 1.0.0: manifest v4, run results v4, sources v3. Notable changes: schema test + data test nodes are renamed to generic test + singular test nodes; freshness threshold default values ([#4191](https://github.com/dbt-labs/dbt-core/pull/4191))
- Speed up node selection by skipping `incorporate_indirect_nodes` if not needed ([#4213](https://github.com/dbt-labs/dbt-core/issues/4213), [#4214](https://github.com/dbt-labs/dbt-core/issues/4214))
- [SF, BQ] When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [SF, BQ] When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144))
- When on_schema_change is set, pass common columns as dest_columns in incremental merge macros ([#4144](https://github.com/dbt-labs/dbt-core/issues/4144), [#4170](https://github.com/dbt-labs/dbt-core/pull/4170))

@github-christophe-oudar
Copy link

Hello @jtcohen6, thank you for your suggestions!
Indeed I didn't look into Postgres/Redshift but if those DBs also support incremental on DBT, that would be a shame not to support them too.

I'll have a look at your other review on BQ adapter.
I thought too that the branching was a bit overkill, it was a "safety measure" in case of a custom strategy that wouldn't update the target table in the DB before the merge code that would result in failing query.
I'll remove it and I guess we can expect someone making a custom strategy making the appropriate changes if needed.

I cherry-picked your commit and added the review change to the changelog.
It looks like tests are passing.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 7, 2021

I thought too that the branching was a bit overkill, it was a "safety measure" in case of a custom strategy that wouldn't update the target table in the DB before the merge code that would result in failing query.
I'll remove it and I guess we can expect someone making a custom strategy making the appropriate changes if needed.

That's a fair thought! The alternative is, rather than having process_schema_changes return the full schema_changes_dict, it could return just the list of common_columns (i.e. dest_columns to use when merging/upserting). Then, the materialization code just pulls in the returned value:

    {#-- Process schema changes. Returns dict of changes if successful. Use source columns for upserting/merging --#}
    {% set dest_columns = process_schema_changes(on_schema_change, tmp_relation, existing_relation) %}
    {% if not dest_columns %}
      {% set dest_columns = adapter.get_columns_in_relation(existing_relation) %}
    {% endif %}

If a user overrides the on_schema_change macro family with a custom mode, they'd just need process_schema_changes to return the right set of columns as a last step, and they wouldn't need to reimplement the incremental materialization as well.

What do you think?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 7, 2021

(closing and re-opening to trigger Postgres integration tests, now that I've added ok to test label)

@jtcohen6 jtcohen6 closed this Nov 7, 2021
@jtcohen6 jtcohen6 reopened this Nov 7, 2021
@github-christophe-oudar
Copy link

I'm ok with both approaches:

  • Either we have a quite exhaustive set of generic columns related informations (source, target, intersection, disjunction, ...) as done in this PR so far and then the incremental strategy pick what's matching.
  • Go for the one you suggested, having check_for_schema_changes return only what's needed as so far, only source_columns appears relevant for that case.

If we want to keep it lean, going for returning as little as necessary might be better indeed (especially since this is done on every run from all users of incremental models).

Let me know if you think I should make the change.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 7, 2021

@github-christophe-oudar Let's go lean!

I suppose it is possible to imagine an incremental strategy that would want other information, besides the set of common columns (post-schema evolution). All strategies that exist today want to operate on a single set of shared columns, named dest_columns. Nor is it possible for users to define and register totally new incremental strategies, without also reimplementing several other macros and/or the incremental materialization (but it's worth thinking more about: #2366).

@Kayrnt Kayrnt force-pushed the on_schema_change_columns_fix branch from 2b8de32 to 33c8b90 Compare November 7, 2021 15:43
@Kayrnt Kayrnt force-pushed the on_schema_change_columns_fix branch from 33c8b90 to 267a2a6 Compare November 7, 2021 15:44
@github-christophe-oudar
Copy link

Ok I updated the 3 PRs with related code 👍
You're right, as so far, I don't really see a common incremental strategy that would require more than those common fields.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kayrnt This looks great to me! Do you feel ready to have this merged?

As soon as the code is merged into dbt-core, then the plugin PRs will be able to use this latest code in their CI. I see you've updated the dbt-snowflake + dbt-bigquery PRs to use the equivalent materialization logic. Could I trouble you to also add the new integration test case (run_incremental_append_new_columns_remove_one) as part of those PRs? You may also need to pull/rebase to get the latest test fixes in main.

If you feel up to it, you could add just the integration test case to dbt-redshift as well. No substantive changes needed :)

@github-christophe-oudar
Copy link

Sure let's go ahead for this one! 🚀
Regarding dbt-snowflake & dbt-bigquery, sure I can handle adding a new integration test case, I gotta go but I should be able to get those added in the evening. I'll see if I can get one more in dbt-redshift as well.

I'm used to fetch & rebase whenever I push so no worry we should be good 👍

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kayrnt Rock on !

Merci beaucoup d'avoir contribué :)

@jtcohen6 jtcohen6 merged commit b591e1a into dbt-labs:main Nov 7, 2021
@github-christophe-oudar
Copy link

Great that we're able to move quite fast on this topic!
I think it's one of my latest pain point regarding DBT! I'm looking forward to using these changes in production 🙌

Merci de passer une partie de ton dimanche pour faire avancer tout ça ! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When on_schema_change is set, pass common columns as dest_columns in incremental merge macros
3 participants