-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed the behavior of the incremental schema change ignore option to properly handle the scenario when columns are dropped #580
Conversation
Thanks for submitting this. Will review and run tests today. |
{%- for dest_col in dest_columns -%} | ||
{%- for source_col in source_columns -%} | ||
{%- if dest_col.name == source_col.name -%} | ||
{%- if common_columns.append(dest_col) -%}{%- endif -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it will not work in the case where columns are added in the new code. Also, I'm not sure this is any more efficient, but stylistically it might be better to do a single loop and check for presence in the second list:
{%- set source_columns = adapter.get_columns_in_relation(source_relation) | map(attribute="name") -%}
{%- for dest_col in dest_columns -%}
{% if dest_col.name in source_columns %}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review year I will refactor it! definitely single loop is more efficient and cool.
This seems like it will not work in the case where columns are added in the new code
I think default option is ignore
and existing code is also using destination columns(not using columns which does not exist in destination table). So I thought Adding new columns is not be the problem. How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may not be familiar enough with the surrounding code, let me look a little deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand now how this works, I think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try explaining it to verify:
This statement is saying that in the target relation (which is the existing table), we are going to insert into it all the new incremental data that matches from the tmp source table; insert into
will provide default values for any columns not named. The issue you are solving is that previously we attempted to select columns from source that were only guaranteed to be in target.
What I'm confused by is why no one has raised this issue on dbt-spark :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @case-k-git , This single loop iteration is breaking functionality and require an immediate fix.
This will only work when source and destination have columns in same order, as a result we forcefully required to perform --full-refresh
. For an instance:
DBT's internal Temp view (source columns: [A, B, C, ... N]) and Destination table (target columns arranged in random fashion due to some process: [A, B, C, D, F, E, .. N]) have actually 14 columns in common, but due to this single iteration change, it compares sequentially and returns only 4 & as a result, inserts 10 blank columns at the time of final execution. Looking forward for the immediate resolutions and test cases as well. Thank you.
CC: @benc-db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nimeshpatni Thank you! I see let me check!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm the error by changing column order . Thank you
#594 (comment)
Running the integration tests now...the interaction here is complicated enough that I want to see if our tests yell, because there's a lot of back and forth between dbt-core and dbt-databricks to come up with the actual merge statement :P |
….com:case-k-git/dbt-databricks into feature/fix-incremental-schema-change-ignore
Rerunning tests now. Will merge after they pass. I updated the changelog to reflect some weirdness that happened in deploys this week. |
@benc-db Thank you ! |
I requested same change to dbt-spark |
Resolves #
#581
Description
When processing incrementally, adding new columns is ignored by the ignore setting. However, when a SQL model is modified to remove columns, it fails despite the ignore setting. This is because it attempts to query a column that does not exist in the created temp table. According to the dbt documentation, the job should be designed not to fail when ignored, so it has been corrected.
For example, in this use case, even if we remove
column_2
from the SQL model, the query still attempts to includecolumn_2
because it exists in the current table schema. However, sincecolumn_2
does not exist in the temporary table, the query fails.The intended SQL insert statement looks like this:
Dbt documentation
So this should not be happen
https://docs.getdbt.com/docs/build/incremental-models#default-behavior
Checklist
CHANGELOG.md
and added information about my change to the "dbt-databricks next" section.