Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped #580

case-k-git · 2024-02-07T13:47:07Z

Resolves #
#581

Description

When processing incrementally, adding new columns is ignored by the ignore setting. However, when a SQL model is modified to remove columns, it fails despite the ignore setting. This is because it attempts to query a column that does not exist in the created temp table. According to the dbt documentation, the job should be designed not to fail when ignored, so it has been corrected.

For example, in this use case, even if we remove column_2 from the SQL model, the query still attempts to include column_2 because it exists in the current table schema. However, since column_2 does not exist in the temporary table, the query fails.

The intended SQL insert statement looks like this:

insert into table `catalog_demo`.`test17060620400077328677_Incremental_strategies`.`append_delta`
      select  column_1, column_2 from `append_delta__dbt_tmp`

Dbt documentation

Similarly, if you remove a column from your incremental model, and execute a dbt run, this column will not be removed from your target table.

So this should not be happen
https://docs.getdbt.com/docs/build/incremental-models#default-behavior

Checklist

I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

benc-db · 2024-02-07T15:52:47Z

Thanks for submitting this. Will review and run tests today.

benc-db · 2024-02-07T17:22:42Z

dbt/include/databricks/macros/materializations/incremental/strategies.sql

+    {%- for dest_col in dest_columns -%}
+        {%- for source_col in source_columns -%}
+            {%- if dest_col.name == source_col.name -%}
+                {%- if common_columns.append(dest_col) -%}{%- endif -%}


This seems like it will not work in the case where columns are added in the new code. Also, I'm not sure this is any more efficient, but stylistically it might be better to do a single loop and check for presence in the second list:

{%- set source_columns = adapter.get_columns_in_relation(source_relation) | map(attribute="name") -%} {%- for dest_col in dest_columns -%} {% if dest_col.name in source_columns %} ...

Thank you for your review year I will refactor it! definitely single loop is more efficient and cool.

This seems like it will not work in the case where columns are added in the new code

I think default option is ignore and existing code is also using destination columns(not using columns which does not exist in destination table). So I thought Adding new columns is not be the problem. How do you think?

I may not be familiar enough with the surrounding code, let me look a little deeper.

Ok, I understand now how this works, I think this is fine.

Let me try explaining it to verify:

This statement is saying that in the target relation (which is the existing table), we are going to insert into it all the new incremental data that matches from the tmp source table; insert into will provide default values for any columns not named. The issue you are solving is that previously we attempted to select columns from source that were only guaranteed to be in target.

What I'm confused by is why no one has raised this issue on dbt-spark :)

@benc-db Thank you for testing and investigation.
I see. Yea I agree and wander why no one has raised this issue till this time.

I refactor this as you suggested. Could you please check this?
b33764c

Hi @case-k-git , This single loop iteration is breaking functionality and require an immediate fix.
This will only work when source and destination have columns in same order, as a result we forcefully required to perform --full-refresh. For an instance:

DBT's internal Temp view (source columns: [A, B, C, ... N]) and Destination table (target columns arranged in random fashion due to some process: [A, B, C, D, F, E, .. N]) have actually 14 columns in common, but due to this single iteration change, it compares sequentially and returns only 4 & as a result, inserts 10 blank columns at the time of final execution. Looking forward for the immediate resolutions and test cases as well. Thank you.
CC: @benc-db

@nimeshpatni Thank you! I see let me check!

I confirm the error by changing column order . Thank you
#594 (comment)

benc-db · 2024-02-07T17:54:55Z

Running the integration tests now...the interaction here is complicated enough that I want to see if our tests yell, because there's a lot of back and forth between dbt-core and dbt-databricks to come up with the actual merge statement :P

….com:case-k-git/dbt-databricks into feature/fix-incremental-schema-change-ignore

benc-db · 2024-02-08T16:53:05Z

Rerunning tests now. Will merge after they pass. I updated the changelog to reflect some weirdness that happened in deploys this week.

case-k-git · 2024-02-09T06:42:39Z

@benc-db Thank you !

case-k-git · 2024-02-11T03:40:42Z

I requested same change to dbt-spark

fix incremental schema-change ignore

fc79430

case-k-git requested review from andrefurlan-db, benc-db and rcypher-databricks as code owners February 7, 2024 13:47

case-k-git changed the title ~~fix incremental schema change ignore~~ Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped. Feb 7, 2024

case-k-git changed the title ~~Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped.~~ Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped Feb 7, 2024

case-k-git and others added 2 commits February 7, 2024 22:52

update change log

321f462

Merge branch 'main' into feature/fix-incremental-schema-change-ignore

e3ba001

benc-db reviewed Feb 7, 2024

View reviewed changes

benc-db temporarily deployed to azure-prod February 7, 2024 17:54 — with GitHub Actions Inactive

case-k-git and others added 3 commits February 8, 2024 11:38

refactor get_insert_into_sql

b33764c

Merge branch 'feature/fix-incremental-schema-change-ignore' of github…

21eb2a1

….com:case-k-git/dbt-databricks into feature/fix-incremental-schema-change-ignore

Update CHANGELOG.md

0e15190

benc-db temporarily deployed to azure-prod February 8, 2024 16:52 — with GitHub Actions Inactive

benc-db approved these changes Feb 9, 2024

View reviewed changes

benc-db merged commit 02e1bb7 into databricks:main Feb 9, 2024
21 checks passed

case-k-git mentioned this pull request Feb 11, 2024

Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped dbt-labs/dbt-spark#980

Draft

4 tasks

case-k-git mentioned this pull request Feb 28, 2024

Not all columns populated on column addition #594

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped #580

Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped #580

case-k-git commented Feb 7, 2024 •

edited

Loading

benc-db commented Feb 7, 2024

benc-db Feb 7, 2024

case-k-git Feb 7, 2024 •

edited

Loading

benc-db Feb 7, 2024

benc-db Feb 7, 2024

benc-db Feb 7, 2024

case-k-git Feb 8, 2024

nimeshpatni Mar 5, 2024

case-k-git Mar 5, 2024

case-k-git Mar 5, 2024

benc-db commented Feb 7, 2024

benc-db commented Feb 8, 2024

case-k-git commented Feb 9, 2024

case-k-git commented Feb 11, 2024

Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped #580

Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped #580

Conversation

case-k-git commented Feb 7, 2024 • edited Loading

Description

Checklist

benc-db commented Feb 7, 2024

Choose a reason for hiding this comment

case-k-git Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benc-db commented Feb 7, 2024

benc-db commented Feb 8, 2024

case-k-git commented Feb 9, 2024

case-k-git commented Feb 11, 2024

case-k-git commented Feb 7, 2024 •

edited

Loading

case-k-git Feb 7, 2024 •

edited

Loading