Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify table schemas before creating models for extra validation #1438

Closed
jdreaver opened this issue May 2, 2019 · 4 comments
Closed

Specify table schemas before creating models for extra validation #1438

jdreaver opened this issue May 2, 2019 · 4 comments
Labels
stale Issues that have gone stale

Comments

@jdreaver
Copy link

jdreaver commented May 2, 2019

Feature

Feature description

In dbt, tests and schema.yml files are a powerful way to validate assumptions about models. However, tests can be limiting:

  • You can only run tests after a model has been built (see Run tests after model creation, persisting updates only if the tests pass #1054).
  • Tests do not contain information about the expected types of columns. For example, I could expect integer, but a floating point operation coerces that column to float.
  • Specifying an extra column in schema.yml doesn't seem to throw an error when building a model or its docs. For example, if my model produces a column foo and I have a column with - name: foo123 in schema.yml, then the model and documentation generate fine without noticing the error.

I propose allowing dbt to use schema.yml schemas to run CREATE TABLE foo(<columns>) plus INSERT INTO foo SELECT instead of CREATE TABLE AS SELECT ... (I'm experienced with postgres, Redshift, and Snowflake. Apologies if other backends work differently.)

This provides the following benefits:

  • Users can provide column constraints that are easily checked like NOT NULL. The database will cheaply validate them for us, and fail to create a model if validation fails. In particular, postgres can also check UNIQUE easily (it creates an index for you under the hood, but you just have to say something is UNIQUE).
  • Users can specify column types, which again will get checked as the model is populated.
  • Users can assert the list of columns in the table schema matches what the model is producing. No extra or missing columns.
  • Some backends like Redshift allow specifying constraints that aren't checked, like foreign keys. These constraints do help Redshift optimize queryies though.

A user could decide to use CREATE TABLE/INSERT instead of CREATE TABLE AS via a boolean flag somewhere for a model. Maybe this flag could be set in schema.yml, since a schema is required for this to work? I'm still a dbt novice, so I do not have much of an informed opinion here 😄

Who will this benefit?

Anyone who uses dbt and wants extra validation in their table schemas would benefit. This feature gives more immediate validation of model schemas and leverages the features in our databases to do this cheaply.

@drewbanin
Copy link
Contributor

Hey @jdreaver - cool idea! We had an issue to generate table definitions from schema.yml specs back in the day -- check it out over here: #1016

There's also a discussion over here about database-level constraints that I think might be relevant.

Ultimately, this is something that I think this would solve some real problems for people, but it would be super involved to build and support. Here's the way that I could see a workflow like this being supported in dbt:

  1. Augment schema.yml to accept arbitrary configs, then provide these configs in the materialization context
  2. Create new materializations that build empty tables, then insert into them

So, you might have a schema.yml file like:

version: 2

models:
  - name: my_model
    columns:
      - name: id
         config:
           type: int
           encode: zstd # Redshift only
           primary_key: true

Then in a materialization, you could loop over the list of provided columns and build a table from DDL, then insert into it from your model SQL. If dbt were to support a workflow like this, I imagine that this is how we'd do it!

So, there is a tractable approach here, but there's still a question in my mind around whether or not we should even support workflows like this in dbt. It's pretty antithetical to the everything is a select design decision at the core of dbt, and I dislike how it fragments logic across model SQL and schema.yml specs.

I can tell you up front, I'm not super interested in supporting column constraints from dbt. If they come naturally as a part of some grander feature, then I'm ok with them, but I remain unconvinced of their utility in analytical settings. In particular, I dislike that they 1) are typically unenforced (or unsupported) on many analytical databases and 2) move constraint checking from test-time to build-time.

With dbt's schema test, we're adding functionality to set a "severity" level (either WARN or ERROR) for test failures. In the future, we might also add a minimum threshold before which tests will warn, and not fail. This is useful in relationships tests where one source table might be loaded before another by some ETL job. If 10 rows out of 10 million are missing a corresponding parent record for 10 minutes, I don't think that should (usually) prevent your models from building!

So, column constraints are a sort of blunt instrument made even blunter by the fact that databases like redshift and snowflake don't generally enforce them. Moreover, Redshift in particular can return incorrect results if a column constraint is applied but invalid!

I do however find merit in enumerating and validating all of the columns in a schema.yml file. I think that in 90% of models, this process is probably unnecessary: the benefit of keeping all of your model logic in a single .sql file outweighs the downside of synchronizing the schema and logic in two different files. If you're really keen to specify a column type in a model, you can always do:

select
  id::bigint as id
from {{ ref('my_model') }}

I think that enumerating column names and types would be most useful when there's some external constraint, like when in-app reporting is happening on top of a given table, for instance. This is distinct from internal analytics use cases, wherein the cost of having an extra column documented for your model is pretty low (and doesn't justify enumerating every column and type in every model). The last thing to say here is that this problem may be better suited for something like a linter than a fundamental redesign in how dbt models IMO.

Ok, finally, dbt supports custom schema tests, and you can very well assert type information for tables if you desire! You can check out the docs on custom schema tests for more information about making one of these yourself.

So, this is a sort of brain dump of all the things in my head regarding this topic. I think it's something we may tackle in the future, but i think the utility is still a little nebulous, at least to me. If you have opinions on any of these things, I'd love to hear them! Thanks for writing in with a thoughtful and thought-provoking feature request :)

@jdreaver
Copy link
Author

jdreaver commented May 6, 2019

Hey @drewbanin! Thanks for the thorough response!

Everything you've said makes total sense. I'm glad you have a guiding principle in the design of dbt (everything is a SELECT), and I totally wouldn't want that to break that for a small use case like this.

I agree that simply validating the list of columns in schema.sql against what is actually produced is probably good enough. My goal here is to avoid documentation drifting from models. That is, if we say a model has columns x, y, and z, but we add a column w or we remove column y, I'd love if something yelled loudly at me before I ran dbt.

I'm going to take a look at custom schema tests to see if I can get those to do what I want. If I come up with something valuable I'll post back to this issue so others can see my solution.

Thanks again for writing all of this up! I'm glad you broke this down by use case instead of by a single potential solution like I did. It gives me a lot to work with.

@drewbanin
Copy link
Contributor

Cool! I'm going to close this issue in favor of something more actionable if/when we decide to prioritize some version of this. Definitely don't hesitate to follow up here if you have any questions or want to share further thoughts on this topic :)

Thanks again!

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Dec 25, 2021
@github-actions github-actions bot closed this as completed Jan 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

2 participants