Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped #580

Conversation

case-k-git
Copy link
Contributor

@case-k-git case-k-git commented Feb 7, 2024

Resolves #
#581

Description

When processing incrementally, adding new columns is ignored by the ignore setting. However, when a SQL model is modified to remove columns, it fails despite the ignore setting. This is because it attempts to query a column that does not exist in the created temp table. According to the dbt documentation, the job should be designed not to fail when ignored, so it has been corrected.

For example, in this use case, even if we remove column_2 from the SQL model, the query still attempts to include column_2 because it exists in the current table schema. However, since column_2 does not exist in the temporary table, the query fails.

The intended SQL insert statement looks like this:

insert into table `catalog_demo`.`test17060620400077328677_Incremental_strategies`.`append_delta`
      select  column_1, column_2 from `append_delta__dbt_tmp`

Dbt documentation

Similarly, if you remove a column from your incremental model, and execute a dbt run, this column will not be removed from your target table.

So this should not be happen
https://docs.getdbt.com/docs/build/incremental-models#default-behavior

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@case-k-git case-k-git changed the title fix incremental schema change ignore Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped. Feb 7, 2024
@case-k-git case-k-git changed the title Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped. Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped Feb 7, 2024
@benc-db
Copy link
Collaborator

benc-db commented Feb 7, 2024

Thanks for submitting this. Will review and run tests today.

{%- for dest_col in dest_columns -%}
{%- for source_col in source_columns -%}
{%- if dest_col.name == source_col.name -%}
{%- if common_columns.append(dest_col) -%}{%- endif -%}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it will not work in the case where columns are added in the new code. Also, I'm not sure this is any more efficient, but stylistically it might be better to do a single loop and check for presence in the second list:

{%- set source_columns = adapter.get_columns_in_relation(source_relation) | map(attribute="name") -%}
{%- for dest_col in dest_columns -%}
   {% if dest_col.name in source_columns %}
...

Copy link
Contributor Author

@case-k-git case-k-git Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review year I will refactor it! definitely single loop is more efficient and cool.

This seems like it will not work in the case where columns are added in the new code

I think default option is ignore and existing code is also using destination columns(not using columns which does not exist in destination table). So I thought Adding new columns is not be the problem. How do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not be familiar enough with the surrounding code, let me look a little deeper.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand now how this works, I think this is fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try explaining it to verify:

This statement is saying that in the target relation (which is the existing table), we are going to insert into it all the new incremental data that matches from the tmp source table; insert into will provide default values for any columns not named. The issue you are solving is that previously we attempted to select columns from source that were only guaranteed to be in target.

What I'm confused by is why no one has raised this issue on dbt-spark :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benc-db Thank you for testing and investigation.
I see. Yea I agree and wander why no one has raised this issue till this time.

I refactor this as you suggested. Could you please check this?
b33764c

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @case-k-git , This single loop iteration is breaking functionality and require an immediate fix.
This will only work when source and destination have columns in same order, as a result we forcefully required to perform --full-refresh. For an instance:
image

DBT's internal Temp view (source columns: [A, B, C, ... N]) and Destination table (target columns arranged in random fashion due to some process: [A, B, C, D, F, E, .. N]) have actually 14 columns in common, but due to this single iteration change, it compares sequentially and returns only 4 & as a result, inserts 10 blank columns at the time of final execution. Looking forward for the immediate resolutions and test cases as well. Thank you.
CC: @benc-db

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nimeshpatni Thank you! I see let me check!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm the error by changing column order . Thank you
#594 (comment)

@benc-db
Copy link
Collaborator

benc-db commented Feb 7, 2024

Running the integration tests now...the interaction here is complicated enough that I want to see if our tests yell, because there's a lot of back and forth between dbt-core and dbt-databricks to come up with the actual merge statement :P

case-k-git and others added 3 commits February 8, 2024 11:38
@benc-db
Copy link
Collaborator

benc-db commented Feb 8, 2024

Rerunning tests now. Will merge after they pass. I updated the changelog to reflect some weirdness that happened in deploys this week.

@benc-db benc-db merged commit 02e1bb7 into databricks:main Feb 9, 2024
21 checks passed
@case-k-git
Copy link
Contributor Author

@benc-db Thank you !

@case-k-git
Copy link
Contributor Author

I requested same change to dbt-spark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants