docs+tests directly in a model #2084

dubravcik · 2020-01-31T11:07:27Z

Describe the feature

When I started to learn dbt, I thought that data documentation is part of a model. Which is true, once it is published on the docs server. But when talking about source code, documentation is separated to schema.yml which is not intuitive for me. I would expect to have an option to write the docs close to the code. I can imagine a part of the config for this :

Who will this benefit?

When one is making changes in a model, he can see the docs and moreover edit it immediately + he sees names of columns produced in the model.

Thanks :)

The text was updated successfully, but these errors were encountered:

drewbanin · 2020-01-31T15:49:42Z

Thanks for opening this issue @dubravcik! I do like the idea of colocating the docs for a model along with its definition, but I have reservations about how well this scheme will work in practice.

What if you wanted to add column-level tests in here too - would you want to encode all of that information in the model config? And what if there are dozens and dozens of columns?

I think that cramming all of that information into a single config block might be pretty noisy in practice! Curious to hear what you think.

dubravcik · 2020-01-31T20:14:20Z

Theoretically, I'd like to move the yaml inside a model, but it is not possible afaik. We have to write in jinja expression in python, i.e. dictionary. Or is there any other option?

If it has to be a dictionary, can we have multiple config expressions? Then it just about preference between the syntax of python or yaml. If we put the docs and tests closer to the output I think it is not so noisy.

{{
  config(
    materialized = "table"
  )
}}

WITH cte AS (
  ....
)
SELECT a.[SK_Account_Master]
      ,[Account Name] = [NAME_AccountName]
      ,[Account Level] = [NAME_AccountLevel]
      ,[Account Type] = [NAME_AccountType]
      ,[Is Testing Account] = [FLAG_IsTestingAccount]
      ,[Partner Revenue] = [AMT_PartnerRevenue_cur]
      ,[Partner Revenue LFY] = [AMT_PartnerRevenue_LFY_cur]
      ,[Partner Revenue TFY] = [AMT_PartnerRevenue_TFY_cur]
      ,[Partner Revenue  Y] = [AMT_PartnerRevenue_12M_cur]
      ,a.[SK_User_Master_Inserted]
      ,a.[SK_User_Master_Modified]
      ,a.[SK_Account_Master_Parent]
      ,a.[SK_Account_Master_SuperParent]
      ,a.[SK_Account_Master_Partner]
      ,a.[SK_Partners_Partner_Master]
      ,a.[SK_Geography_Master]
      ,a.[SK_Currency_Master]
      ,a.[SK_User_Master_TSM]
      ,a.[SK_User_Master_AccountManager]
      ,a.[SK_User_Master_AccountExecutive]
      ,a.[SK_User_Master_CSM]
      ,a.[SK_Campaign_Master_SourceCampaign]
      ,[DTIME Inserted] = a.[DTIME_Inserted]
      ,[DATE Inserted] = a.[SK_Date_Inserted]
      ,[DTIME Modified] = a.[DTIME_Modified]
      ,[DATE Modified] = a.[SK_Date_Modified]
 FROM {{ source('account') }} a
  LEFT JOIN cte ...

{{
  config(
    description = "account is company"
    columns = [
      {
        "name": "SK_Account_Master",
        "description": "primary key of Account",
        "tests": ["unique", "not_null"]
      },
      { 
        "name": "Account name", 
        "description": "business name" },
      { 
        "name": "Account Level",
        "description": "basic advanced pro" 
      },
      { 
        "name": "Account Type", 
        "description": "partner or client" 
      },
      {
        "name": "SK_User_Master_Inserted",
        "description": "User who created the Account",
        "tests": { "relationships": { "to": "ref('user')", "field": "id" } }
      }
    ]
    tests = [
      { "accepted_values": {"column_name": "[account type]", "values": ['partner', 'client']}},
      { "unique": {"column_name": "concat([account name], [account type])"}}
    ]
  )
}}

I don't say this is exactly how I want it and I would use stright away, just suggestion for discussion :)

jrandrews · 2020-01-31T21:50:13Z

Honestly I like having it separate. git-based version control systems are more "friendly" with multiple smaller files than they are with fewer larger ones. More, smaller files makes it easier to eyeball quickly where changes have taken place and also makes merge conflicts less likely.

dubravcik · 2020-07-01T16:57:13Z

I am coming with another idea. I don't know whether it is a problem for others, but for me the problem is that the documentation

is in another file
have to create the file and also structure in the file manually (sure, there are some helpers Feature: Command to auto-generate schema.yml files #1082 )
have to keep it in sync when the model is changed (probably the biggest pain)
doesn't really document the code itself

The idea is to have the documentation inside the model using comments. Other programming languages use comments for documentation, so it could work as well.

For example a comment --docs < column > < description > could be parsed and used for documentation. The position of such comment would not be important as it holds the column name, so it would work also in long models with many CTEs. It could be another option to document model and could be overridden by schema.yml
What do you think @drewbanin @jrandrews ?

SELECT
       --docs full_name full name of customer
       full_name
       --docs country country where customer is located based on IP location
       ,country
       --docs browser_language language set in customer browser
       ,browser_language
       --docs date_created_at date customer created
       --this is an unrelated commented to docs
       ,date_created_at
       --docs date_last_login date customer logged in        
       ,date_last_login
  FROM analytics.customer

I haven't thought about column level tests yet.

jtcohen6 · 2020-12-04T18:47:35Z

I'm thinking about this in the same way I'm thinking about #2401, which is on our 1.0 to-do list, and could be essentially summarized as: "Reconcile node configs and resource properties, where possible and it makes sense."

Today, it's not totally clear what can be defined in one vs. the other—or both (tags). This is a big point of confusion for new dbt users. Of course, we don't want to create more confusion in the process: We'll need to come up with ways to reconcile config-properties that are set hierarchically in .yml files with the ones set in config() blocks. I think in-file config() should always win, but I could also see wanting to raise a warning if there's conflicting information set at the exact same level (e.g. the same column described twice).

Personally, I find myself agreeing with @jrandrews's point above. I think that, by and large, we'll want to maintain the functional distinction that exists today, recast as a convention:

Configure the behavior of a model's execution with config() or dbt_project.yml
Describe a model's properties in models/*.yml

tomsej · 2021-02-01T15:31:08Z

We had a similar problem and wanted to post an issue. The worst thing for us was that we often forgot to edit the schema file when we added / removed columns. But I think having a description in sql can lead to a lot of commits without changing business logic, let alone merge conflicts. We have created a tool for us that will allow us to check the differences between schema and sql comments + much more. Maybe it can help you - https://github.com/offbi/pre-commit-dbt

dubravcik added enhancement New feature or request triage labels Jan 31, 2020

drewbanin added discussion and removed triage labels Jan 31, 2020

dbt-labs locked and limited conversation to collaborators Apr 19, 2022

jtcohen6 converted this issue into discussion #5093 Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

docs+tests directly in a model #2084

docs+tests directly in a model #2084

dubravcik commented Jan 31, 2020

drewbanin commented Jan 31, 2020

dubravcik commented Jan 31, 2020 •

edited

Loading

jrandrews commented Jan 31, 2020

dubravcik commented Jul 1, 2020 •

edited

Loading

jtcohen6 commented Dec 4, 2020

tomsej commented Feb 1, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

docs+tests directly in a model #2084

docs+tests directly in a model #2084

Comments

dubravcik commented Jan 31, 2020

Describe the feature

Who will this benefit?

drewbanin commented Jan 31, 2020

dubravcik commented Jan 31, 2020 • edited Loading

jrandrews commented Jan 31, 2020

dubravcik commented Jul 1, 2020 • edited Loading

jtcohen6 commented Dec 4, 2020

tomsej commented Feb 1, 2021

This issue was moved to a discussion.

dubravcik commented Jan 31, 2020 •

edited

Loading

dubravcik commented Jul 1, 2020 •

edited

Loading