-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify table schemas before creating models for extra validation #1438
Comments
Hey @jdreaver - cool idea! We had an issue to generate table definitions from schema.yml specs back in the day -- check it out over here: #1016 There's also a discussion over here about database-level constraints that I think might be relevant. Ultimately, this is something that I think this would solve some real problems for people, but it would be super involved to build and support. Here's the way that I could see a workflow like this being supported in dbt:
So, you might have a schema.yml file like:
Then in a materialization, you could loop over the list of provided So, there is a tractable approach here, but there's still a question in my mind around whether or not we should even support workflows like this in dbt. It's pretty antithetical to the everything is a select design decision at the core of dbt, and I dislike how it fragments logic across model SQL and schema.yml specs. I can tell you up front, I'm not super interested in supporting column constraints from dbt. If they come naturally as a part of some grander feature, then I'm ok with them, but I remain unconvinced of their utility in analytical settings. In particular, I dislike that they 1) are typically unenforced (or unsupported) on many analytical databases and 2) move constraint checking from test-time to build-time. With dbt's schema test, we're adding functionality to set a "severity" level (either WARN or ERROR) for test failures. In the future, we might also add a minimum threshold before which tests will warn, and not fail. This is useful in So, column constraints are a sort of blunt instrument made even blunter by the fact that databases like redshift and snowflake don't generally enforce them. Moreover, Redshift in particular can return incorrect results if a column constraint is applied but invalid! I do however find merit in enumerating and validating all of the columns in a
I think that enumerating column names and types would be most useful when there's some external constraint, like when in-app reporting is happening on top of a given table, for instance. This is distinct from internal analytics use cases, wherein the cost of having an extra column documented for your model is pretty low (and doesn't justify enumerating every column and type in every model). The last thing to say here is that this problem may be better suited for something like a linter than a fundamental redesign in how dbt models IMO. Ok, finally, dbt supports custom schema tests, and you can very well assert type information for tables if you desire! You can check out the docs on custom schema tests for more information about making one of these yourself. So, this is a sort of brain dump of all the things in my head regarding this topic. I think it's something we may tackle in the future, but i think the utility is still a little nebulous, at least to me. If you have opinions on any of these things, I'd love to hear them! Thanks for writing in with a thoughtful and thought-provoking feature request :) |
Hey @drewbanin! Thanks for the thorough response! Everything you've said makes total sense. I'm glad you have a guiding principle in the design of I agree that simply validating the list of columns in I'm going to take a look at custom schema tests to see if I can get those to do what I want. If I come up with something valuable I'll post back to this issue so others can see my solution. Thanks again for writing all of this up! I'm glad you broke this down by use case instead of by a single potential solution like I did. It gives me a lot to work with. |
Cool! I'm going to close this issue in favor of something more actionable if/when we decide to prioritize some version of this. Definitely don't hesitate to follow up here if you have any questions or want to share further thoughts on this topic :) Thanks again! |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Feature
Feature description
In
dbt
, tests andschema.yml
files are a powerful way to validate assumptions about models. However, tests can be limiting:integer
, but a floating point operation coerces that column tofloat
.schema.yml
doesn't seem to throw an error when building a model or its docs. For example, if my model produces a columnfoo
and I have a column with- name: foo123
inschema.yml
, then the model and documentation generate fine without noticing the error.I propose allowing
dbt
to useschema.yml
schemas to runCREATE TABLE foo(<columns>)
plusINSERT INTO foo SELECT
instead ofCREATE TABLE AS SELECT ...
(I'm experienced with postgres, Redshift, and Snowflake. Apologies if other backends work differently.)This provides the following benefits:
NOT NULL
. The database will cheaply validate them for us, and fail to create a model if validation fails. In particular,postgres
can also checkUNIQUE
easily (it creates an index for you under the hood, but you just have to say something isUNIQUE
).A user could decide to use
CREATE TABLE
/INSERT
instead ofCREATE TABLE AS
via a boolean flag somewhere for a model. Maybe this flag could be set inschema.yml
, since a schema is required for this to work? I'm still adbt
novice, so I do not have much of an informed opinion here 😄Who will this benefit?
Anyone who uses
dbt
and wants extra validation in their table schemas would benefit. This feature gives more immediate validation of model schemas and leverages the features in our databases to do this cheaply.The text was updated successfully, but these errors were encountered: