-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CT 2196, CT2121 constraints column order #7161
Conversation
Change check of column definitions to be order aganostic.
core/dbt/include/global_project/macros/materializations/models/table/columns_spec_ddl.sql
Outdated
Show resolved
Hide resolved
core/dbt/include/global_project/macros/materializations/models/table/create_table_as.sql
Outdated
Show resolved
Hide resolved
core/dbt/include/global_project/macros/materializations/models/table/columns_spec_ddl.sql
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just a matter of if you wanted to update the wording of the error to include something about the model having contracts enabled or not.
core/dbt/exceptions.py
Outdated
msg = ( | ||
"Please ensure the name, data_type, and number of columns in your `yml` file " | ||
"match the columns in your SQL file.\n" | ||
f"Schema File Columns: {self.yaml_columns}\n" | ||
f"SQL File Columns: {self.sql_columns}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you going to update this as related to #7147?
{#-- Column with name not found in yaml --#} | ||
{%- do exceptions.raise_contract_error(yaml_columns, sql_columns) -%} | ||
{%- endif -%} | ||
{%- if sql_col.dtype != yaml_col[0].dtype -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously column types were compared based on their formatted representation, which could be data platform specific as implemented by the <adapter>__format_column
macro.
For example, BigQuery's format_column
implementation compares data_type values by column.data_type
as opposed to column.dtype
in order to make comparisons on nested data types.
Strictly comparing SQL and yml data_type values by dtype
would allow the following contract to be accepted in BigQuery:
SELECT
STRUCT("test" AS name, [1.0,2.0] AS laps) as some_struct
models:
- name: test_schema
config:
contract: true
columns:
- name: some_struct
data_type: STRUCT<name FLOAT64, laps STRING> #wrong! but accepted because dtype == STRUCT for both SQL and schema.yml
One workaround would be to do this comparison using format_column
instead, i.e: adapter.dispatch('format_column', 'dbt')(sql_col) != adapter.dispatch('format_column', 'dbt')(yaml_col[0])
. This would also ensure the comparison and error messaging are using consistent logic. An alternative would be to implement a default__compare_data_types
macro to enable adapter-specific implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at using format_column but the default implementation of that just uses the dtype. Is that different in the other adapters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually see any implementations of format_column in the adapters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was on the wrong bigquery branch. Bigquery is the only adapter that implements it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we include the formatted column in the Column structures returned by "get_column_schema_from_query"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or concatenate the other parts of the column structure for comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a lot of logic to put in a jinja template. Could we put this in a python function and then wrap that python function in this macro? I'm cutting corners, but this is what's in my head:
columns_spec_ddl.sql
{% macro assert_columns_equivalent(sql) %}
{% set sql_schema = get_column_schema_from_query(sql) %}
{% set model_contract = model_contract(model['columns']) %}
{% do assert_schema_meets_contract(sql_schema, model_contract) %}
dbt/adapters/base/impl.py
(for lack of a better spot)
# these are defined elsewhere, but look something like this
ModelContract = List[ColumnContract]
Schema = List[Column]
def model_contract(model) -> ModelContract:
# I assume we have a way of creating a model contract from a `schema.yml` file
return ModelContract(model)
def assert_schema_meets_contract(schema: Schema, model_contract: ModelContract)
if len(schema) != len(model_contract):
raise ContractError(msg)
for schema_column, contract_column in zip(sorted(schema), sorted(model_contract)):
try:
assert schema_column.name == contract_column.name
assert schema_column.dtype == contract_column.dtype
except AssertionError:
raise ContractError(msg)
I think the python version would be much easier to unit test.
core/dbt/include/global_project/macros/materializations/models/table/columns_spec_ddl.sql
Show resolved
Hide resolved
"Contracts are enabled for this model. " | ||
"Please ensure the name, data_type, and number of columns in your `yml` file " | ||
"match the columns in your SQL file.\n" | ||
f"Schema File Columns: {self.yaml_columns}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to sort these both alphabetically (message only) so that it's easier for the user to spot the difference.
{#-- Column with name not found in yaml --#} | ||
{%- do exceptions.raise_contract_error(yaml_columns, sql_columns) -%} | ||
{%- endif -%} | ||
{%- if sql_col.dtype != yaml_col[0].dtype -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a lot of logic to put in a jinja template. Could we put this in a python function and then wrap that python function in this macro? I'm cutting corners, but this is what's in my head:
columns_spec_ddl.sql
{% macro assert_columns_equivalent(sql) %}
{% set sql_schema = get_column_schema_from_query(sql) %}
{% set model_contract = model_contract(model['columns']) %}
{% do assert_schema_meets_contract(sql_schema, model_contract) %}
dbt/adapters/base/impl.py
(for lack of a better spot)
# these are defined elsewhere, but look something like this
ModelContract = List[ColumnContract]
Schema = List[Column]
def model_contract(model) -> ModelContract:
# I assume we have a way of creating a model contract from a `schema.yml` file
return ModelContract(model)
def assert_schema_meets_contract(schema: Schema, model_contract: ModelContract)
if len(schema) != len(model_contract):
raise ContractError(msg)
for schema_column, contract_column in zip(sorted(schema), sorted(model_contract)):
try:
assert schema_column.name == contract_column.name
assert schema_column.dtype == contract_column.dtype
except AssertionError:
raise ContractError(msg)
I think the python version would be much easier to unit test.
I don't see how a "ModelContract" would be different than model.columns. Also this code has the same issue that the earlier jinja code did in that it's only checking dtype, not the formatted column. Would we also have to implement a "format_column" method in the python adapters? |
Currently, these two are really the same thing but I could see a definition of In Mike's proposal,
I agree with this - we'd need to add an overridable |
This all might be a good idea, but is totally out of scope for this ticket as it was defined. If you want us to close this one and wait until this other implementation is defined, let me know. |
100% agree! The scope for this issue was to update the behaviour of the contract checks + constraint ddl generation to be order-agnostic, not a broader rewrite of the contract check. I think if we decide to go down the route of introducing a more generalized |
@@ -25,11 +25,26 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking through what this might look like across all potential adapters, and if we need to add contract-related items in the future. With that context, I have the following questions:
- Do you think this could benefit from becoming a "dispatch" method? The goal of this would be to create a hard divide between contracted models and non-contracted models.
- Do we need to include the column select statement in
default__get_select_subquery
if we already validatedget_assert_columns_equivalent
? We validate that the number of columns are the same, and the names are the same. So I think the subquery already limits to just the columns that we want. - It looks like
get_assert_columns_equivalent
might have been renamedassert_columns_equivalent
.
With my assumptions (not necessarily true of course):
{% macro default__create_table_as(temporary, relation, sql) -%}
{% if config.get('contract', False) %}
{{ default__create_table_as_with_contract(temporary, relation, sql }}
{% else %}
{{ default__create_table_as_without_contract(temporary, relation, sql }}
{% endif %}
{% endmacro %}
{% macro default__create_table_as_with_contract(temporary, relation, sql) %}
{{ get_assert_columns_equivalent(sql) }}
create {% if temporary: -%}temporary{%- endif %} table
{{ relation.include(database=(not temporary), schema=(not temporary)) }}
{{ get_columns_spec_ddl() }}
as ({{ sql }})
{% endmacro %}
{% macro default__create_table_as_without_contract(temporary, relation, sql) %}
create {% if temporary: -%}temporary{%- endif %} table
{{ relation.include(database=(not temporary), schema=(not temporary)) }}
as ({{ sql }})
{% endmacro %}
This would maintain backwards compatibility because we're keeping the macro create_table_as
, which is what would have been overridden. And if we need to update only one of these in the future, it isolates the updates, instead of impacting all "create table" workflows. I'm open to feedback and I'm just trying to communicate a thought here. Please let me know what you think of these recommendations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the get_select_subquery in order to put the columns in the right order. We removed validating the order because we added the subquery. As far as splitting out the macro into two, probably @jtcohen6 and @MichelleArk should weigh in on that.
resolves #6975, #7064
Description
Modify create_table_as ddl to use the order specified in the model config "columns". Modify "assert_columns_equivalent" to not check column order.
Checklist
changie new
to create a changelog entry