CT 2196, CT2121 constraints column order #7161

gshank · 2023-03-13T18:03:21Z

Description

Modify create_table_as ddl to use the order specified in the model config "columns". Modify "assert_columns_equivalent" to not check column order.

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

Change check of column definitions to be order aganostic.

core/dbt/include/global_project/macros/materializations/models/table/columns_spec_ddl.sql

core/dbt/include/global_project/macros/materializations/models/table/create_table_as.sql

core/dbt/include/global_project/macros/materializations/models/table/columns_spec_ddl.sql

emmyoop

Looks good to me! Just a matter of if you wanted to update the wording of the error to include something about the model having contracts enabled or not.

emmyoop · 2023-03-13T19:48:47Z

core/dbt/exceptions.py

+        msg = (
+            "Please ensure the name, data_type, and number of columns in your `yml` file "
+            "match the columns in your SQL file.\n"
+            f"Schema File Columns: {self.yaml_columns}\n"
+            f"SQL File Columns: {self.sql_columns}"
+        )


Were you going to update this as related to #7147?

MichelleArk · 2023-03-14T15:48:25Z

core/dbt/include/global_project/macros/materializations/models/table/columns_spec_ddl.sql

+      {#-- Column with name not found in yaml --#}
+      {%- do exceptions.raise_contract_error(yaml_columns, sql_columns) -%}
+    {%- endif -%}
+    {%- if sql_col.dtype != yaml_col[0].dtype -%}


Previously column types were compared based on their formatted representation, which could be data platform specific as implemented by the <adapter>__format_column macro.

For example, BigQuery's format_column implementation compares data_type values by column.data_type as opposed to column.dtype in order to make comparisons on nested data types.

Strictly comparing SQL and yml data_type values by dtype would allow the following contract to be accepted in BigQuery:

SELECT STRUCT("test" AS name, [1.0,2.0] AS laps) as some_struct

models: - name: test_schema config: contract: true columns: - name: some_struct data_type: STRUCT<name FLOAT64, laps STRING> #wrong! but accepted because dtype == STRUCT for both SQL and schema.yml

One workaround would be to do this comparison using format_column instead, i.e: adapter.dispatch('format_column', 'dbt')(sql_col) != adapter.dispatch('format_column', 'dbt')(yaml_col[0]). This would also ensure the comparison and error messaging are using consistent logic. An alternative would be to implement a default__compare_data_types macro to enable adapter-specific implementations.

I looked at using format_column but the default implementation of that just uses the dtype. Is that different in the other adapters?

I don't actually see any implementations of format_column in the adapters.

Oh, I was on the wrong bigquery branch. Bigquery is the only adapter that implements it.

Could we include the formatted column in the Column structures returned by "get_column_schema_from_query"?

Or concatenate the other parts of the column structure for comparison?

This looks like a lot of logic to put in a jinja template. Could we put this in a python function and then wrap that python function in this macro? I'm cutting corners, but this is what's in my head:

columns_spec_ddl.sql

{% macro assert_columns_equivalent(sql) %} {% set sql_schema = get_column_schema_from_query(sql) %} {% set model_contract = model_contract(model['columns']) %} {% do assert_schema_meets_contract(sql_schema, model_contract) %}

dbt/adapters/base/impl.py (for lack of a better spot)

# these are defined elsewhere, but look something like this ModelContract = List[ColumnContract] Schema = List[Column] def model_contract(model) -> ModelContract: # I assume we have a way of creating a model contract from a `schema.yml` file return ModelContract(model) def assert_schema_meets_contract(schema: Schema, model_contract: ModelContract) if len(schema) != len(model_contract): raise ContractError(msg) for schema_column, contract_column in zip(sorted(schema), sorted(model_contract)): try: assert schema_column.name == contract_column.name assert schema_column.dtype == contract_column.dtype except AssertionError: raise ContractError(msg)

I think the python version would be much easier to unit test.

core/dbt/include/global_project/macros/materializations/models/table/columns_spec_ddl.sql

mikealfare · 2023-03-14T19:20:29Z

core/dbt/exceptions.py

+            "Contracts are enabled for this model. "
+            "Please ensure the name, data_type, and number of columns in your `yml` file "
+            "match the columns in your SQL file.\n"
+            f"Schema File Columns: {self.yaml_columns}\n"


It might be useful to sort these both alphabetically (message only) so that it's easier for the user to spot the difference.

mikealfare · 2023-03-14T20:01:18Z

core/dbt/include/global_project/macros/materializations/models/table/columns_spec_ddl.sql

+      {#-- Column with name not found in yaml --#}
+      {%- do exceptions.raise_contract_error(yaml_columns, sql_columns) -%}
+    {%- endif -%}
+    {%- if sql_col.dtype != yaml_col[0].dtype -%}


This looks like a lot of logic to put in a jinja template. Could we put this in a python function and then wrap that python function in this macro? I'm cutting corners, but this is what's in my head:

columns_spec_ddl.sql

{% macro assert_columns_equivalent(sql) %} {% set sql_schema = get_column_schema_from_query(sql) %} {% set model_contract = model_contract(model['columns']) %} {% do assert_schema_meets_contract(sql_schema, model_contract) %}

dbt/adapters/base/impl.py (for lack of a better spot)

# these are defined elsewhere, but look something like this ModelContract = List[ColumnContract] Schema = List[Column] def model_contract(model) -> ModelContract: # I assume we have a way of creating a model contract from a `schema.yml` file return ModelContract(model) def assert_schema_meets_contract(schema: Schema, model_contract: ModelContract) if len(schema) != len(model_contract): raise ContractError(msg) for schema_column, contract_column in zip(sorted(schema), sorted(model_contract)): try: assert schema_column.name == contract_column.name assert schema_column.dtype == contract_column.dtype except AssertionError: raise ContractError(msg)

I think the python version would be much easier to unit test.

gshank · 2023-03-14T20:50:29Z

I don't see how a "ModelContract" would be different than model.columns. Also this code has the same issue that the earlier jinja code did in that it's only checking dtype, not the formatted column. Would we also have to implement a "format_column" method in the python adapters?

MichelleArk · 2023-03-15T00:39:24Z

I don't see how a "ModelContract" would be different than model.columns.

Currently, these two are really the same thing but I could see a definition of ModelContract that includes model columns (names + data_types) as well as constraints, and potentially materializations. Basically any model attribute that could be modified in a way that would constitute a breaking change (#7065, #6869).

In Mike's proposal, assert_schema_meets_contract would just be comparing the model columns component of ModelContract to implement the pre-flight model contract checks necessary for the scope of this work. However, a ModelContract structure to represent the more general model contract could be useful as we implement state:modified checks to detect breaking changes to model contracts (#7065, #6869). I pushed up some prototyping I did on the state:modified check for column names + data_types and could imagine a ModelContract being instantiated here.

Also this code has the same issue that the earlier jinja code did in that it's only checking dtype, not the formatted column. Would we also have to implement a "format_column" method in the python adapters?

I agree with this - we'd need to add an overridable formatted_dtype property/method on the Column interface to provide a complete and user-friendly representation of the data_type. Note: we should not just use the existing data_type definitions for this comparison because it does some dbt-specific formatting, which would make the mismatched contract error message less reflective of what the user actually has to specify as a valid value for data_type in their schema.yml (more context here: #6986 (comment))

gshank · 2023-03-15T00:42:29Z

This all might be a good idea, but is totally out of scope for this ticket as it was defined. If you want us to close this one and wait until this other implementation is defined, let me know.

MichelleArk · 2023-03-15T01:27:49Z

out of scope for this ticket as it was defined

100% agree! The scope for this issue was to update the behaviour of the contract checks + constraint ddl generation to be order-agnostic, not a broader rewrite of the contract check. I think if we decide to go down the route of introducing a more generalized ModelContract structure as part of thestate:modified issues (#7065, #6869), we can revisit making use of it to implement more of the contract validation in python. I can make an issue to keep track of the idea.

tests/adapter/dbt/tests/adapter/constraints/test_constraints.py

mikealfare · 2023-03-19T16:30:27Z

core/dbt/include/global_project/macros/materializations/models/table/create_table_as.sql

@@ -25,11 +25,26 @@



I'm thinking through what this might look like across all potential adapters, and if we need to add contract-related items in the future. With that context, I have the following questions:

Do you think this could benefit from becoming a "dispatch" method? The goal of this would be to create a hard divide between contracted models and non-contracted models.

Do we need to include the column select statement in default__get_select_subquery if we already validated get_assert_columns_equivalent? We validate that the number of columns are the same, and the names are the same. So I think the subquery already limits to just the columns that we want.

It looks like get_assert_columns_equivalent might have been renamed assert_columns_equivalent.

With my assumptions (not necessarily true of course):

{% macro default__create_table_as(temporary, relation, sql) -%} {% if config.get('contract', False) %} {{ default__create_table_as_with_contract(temporary, relation, sql }} {% else %} {{ default__create_table_as_without_contract(temporary, relation, sql }} {% endif %} {% endmacro %} {% macro default__create_table_as_with_contract(temporary, relation, sql) %} {{ get_assert_columns_equivalent(sql) }} create {% if temporary: -%}temporary{%- endif %} table {{ relation.include(database=(not temporary), schema=(not temporary)) }} {{ get_columns_spec_ddl() }} as ({{ sql }}) {% endmacro %} {% macro default__create_table_as_without_contract(temporary, relation, sql) %} create {% if temporary: -%}temporary{%- endif %} table {{ relation.include(database=(not temporary), schema=(not temporary)) }} as ({{ sql }}) {% endmacro %}

This would maintain backwards compatibility because we're keeping the macro create_table_as, which is what would have been overridden. And if we need to update only one of these in the future, it isolates the updates, instead of impacting all "create table" workflows. I'm open to feedback and I'm just trying to communicate a thought here. Please let me know what you think of these recommendations.

We need the get_select_subquery in order to put the columns in the right order. We removed validating the order because we added the subquery. As far as splitting out the macro into two, probably @jtcohen6 and @MichelleArk should weigh in on that.

gshank added 2 commits March 13, 2023 13:55

Make create_table_as ddl match defined column order.

b3002e1

Change check of column definitions to be order aganostic.

Changie

e257ef1

gshank requested review from a team as code owners March 13, 2023 18:03

gshank requested review from nathaniel-may and stu-k March 13, 2023 18:03

cla-bot bot added the cla:yes label Mar 13, 2023

gshank requested review from MichelleArk, emmyoop, peterallenwebb and mikealfare and removed request for nathaniel-may and stu-k March 13, 2023 18:03

emmyoop requested changes Mar 13, 2023

View reviewed changes

Raise contract error instead of compiler error

4d06b6c

gshank requested a review from a team as a code owner March 13, 2023 19:43

emmyoop approved these changes Mar 13, 2023

View reviewed changes

gshank added 3 commits March 13, 2023 16:20

Add statement about contracts enabled to error message.

d64810b

Remove some strips from get_select_subquery

7780040

Merge branch 'main' into ct-2196-constraints_column_order

bbc18b7

MichelleArk reviewed Mar 14, 2023

View reviewed changes

core/dbt/include/global_project/macros/materializations/models/table/columns_spec_ddl.sql Show resolved Hide resolved

gshank added 2 commits March 14, 2023 15:53

Use 'format_columns' to compare column data types

561a4a7

Update comment

3ea1c0a

mikealfare reviewed Mar 14, 2023

View reviewed changes

gshank added 2 commits March 14, 2023 17:23

Reorder SQL columns to make changed order more clear

39fa7b1

Another sql column order

852a4b8

gshank added 2 commits March 15, 2023 16:04

Change date_day in test_constraints.py from a date to a string

bdd555f

replace date_day type in error message

6a98e60

mikealfare reviewed Mar 17, 2023

View reviewed changes

tests/adapter/dbt/tests/adapter/constraints/test_constraints.py Show resolved Hide resolved

mikealfare reviewed Mar 19, 2023

View reviewed changes

gshank merged commit a203fe8 into main Mar 19, 2023

gshank deleted the ct-2196-constraints_column_order branch March 19, 2023 23:24

This was referenced Mar 20, 2023

[CT-2196] Make constraints ddl agnostic to column order #7064

Closed

[CT-2287] Clarify contract error message #7147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CT 2196, CT2121 constraints column order #7161

CT 2196, CT2121 constraints column order #7161

gshank commented Mar 13, 2023

emmyoop left a comment

emmyoop Mar 13, 2023

MichelleArk Mar 14, 2023

gshank Mar 14, 2023

gshank Mar 14, 2023

gshank Mar 14, 2023

gshank Mar 14, 2023

gshank Mar 14, 2023

mikealfare Mar 14, 2023

mikealfare Mar 14, 2023

mikealfare Mar 14, 2023

gshank commented Mar 14, 2023 •

edited

Loading

MichelleArk commented Mar 15, 2023

gshank commented Mar 15, 2023

MichelleArk commented Mar 15, 2023 •

edited

Loading

mikealfare Mar 19, 2023

gshank Mar 19, 2023

CT 2196, CT2121 constraints column order #7161

CT 2196, CT2121 constraints column order #7161

Conversation

gshank commented Mar 13, 2023

Description

Checklist

emmyoop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gshank commented Mar 14, 2023 • edited Loading

MichelleArk commented Mar 15, 2023

gshank commented Mar 15, 2023

MichelleArk commented Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gshank commented Mar 14, 2023 •

edited

Loading

MichelleArk commented Mar 15, 2023 •

edited

Loading