Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs+tests directly in a model #2084

Closed
dubravcik opened this issue Jan 31, 2020 · 6 comments
Closed

docs+tests directly in a model #2084

dubravcik opened this issue Jan 31, 2020 · 6 comments
Labels
discussion enhancement New feature or request

Comments

@dubravcik
Copy link

Describe the feature

When I started to learn dbt, I thought that data documentation is part of a model. Which is true, once it is published on the docs server. But when talking about source code, documentation is separated to schema.yml which is not intuitive for me. I would expect to have an option to write the docs close to the code. I can imagine a part of the config for this :
image

Who will this benefit?

When one is making changes in a model, he can see the docs and moreover edit it immediately + he sees names of columns produced in the model.

Thanks :)

@dubravcik dubravcik added enhancement New feature or request triage labels Jan 31, 2020
@drewbanin
Copy link
Contributor

Thanks for opening this issue @dubravcik! I do like the idea of colocating the docs for a model along with its definition, but I have reservations about how well this scheme will work in practice.

What if you wanted to add column-level tests in here too - would you want to encode all of that information in the model config? And what if there are dozens and dozens of columns?

I think that cramming all of that information into a single config block might be pretty noisy in practice! Curious to hear what you think.

@dubravcik
Copy link
Author

dubravcik commented Jan 31, 2020

Theoretically, I'd like to move the yaml inside a model, but it is not possible afaik. We have to write in jinja expression in python, i.e. dictionary. Or is there any other option?

If it has to be a dictionary, can we have multiple config expressions? Then it just about preference between the syntax of python or yaml. If we put the docs and tests closer to the output I think it is not so noisy.

{{
  config(
    materialized = "table"
  )
}}

WITH cte AS (
  ....
)
SELECT a.[SK_Account_Master]
      ,[Account Name] = [NAME_AccountName]
      ,[Account Level] = [NAME_AccountLevel]
      ,[Account Type] = [NAME_AccountType]
      ,[Is Testing Account] = [FLAG_IsTestingAccount]
      ,[Partner Revenue] = [AMT_PartnerRevenue_cur]
      ,[Partner Revenue LFY] = [AMT_PartnerRevenue_LFY_cur]
      ,[Partner Revenue TFY] = [AMT_PartnerRevenue_TFY_cur]
      ,[Partner Revenue  Y] = [AMT_PartnerRevenue_12M_cur]
      ,a.[SK_User_Master_Inserted]
      ,a.[SK_User_Master_Modified]
      ,a.[SK_Account_Master_Parent]
      ,a.[SK_Account_Master_SuperParent]
      ,a.[SK_Account_Master_Partner]
      ,a.[SK_Partners_Partner_Master]
      ,a.[SK_Geography_Master]
      ,a.[SK_Currency_Master]
      ,a.[SK_User_Master_TSM]
      ,a.[SK_User_Master_AccountManager]
      ,a.[SK_User_Master_AccountExecutive]
      ,a.[SK_User_Master_CSM]
      ,a.[SK_Campaign_Master_SourceCampaign]
      ,[DTIME Inserted] = a.[DTIME_Inserted]
      ,[DATE Inserted] = a.[SK_Date_Inserted]
      ,[DTIME Modified] = a.[DTIME_Modified]
      ,[DATE Modified] = a.[SK_Date_Modified]
 FROM {{ source('account') }} a
  LEFT JOIN cte ...

{{
  config(
    description = "account is company"
    columns = [
      {
        "name": "SK_Account_Master",
        "description": "primary key of Account",
        "tests": ["unique", "not_null"]
      },
      { 
        "name": "Account name", 
        "description": "business name" },
      { 
        "name": "Account Level",
        "description": "basic advanced pro" 
      },
      { 
        "name": "Account Type", 
        "description": "partner or client" 
      },
      {
        "name": "SK_User_Master_Inserted",
        "description": "User who created the Account",
        "tests": { "relationships": { "to": "ref('user')", "field": "id" } }
      }
    ]
    tests = [
      { "accepted_values": {"column_name": "[account type]", "values": ['partner', 'client']}},
      { "unique": {"column_name": "concat([account name], [account type])"}}
    ]
  )
}}

I don't say this is exactly how I want it and I would use stright away, just suggestion for discussion :)

@jrandrews
Copy link

Honestly I like having it separate. git-based version control systems are more "friendly" with multiple smaller files than they are with fewer larger ones. More, smaller files makes it easier to eyeball quickly where changes have taken place and also makes merge conflicts less likely.

@dubravcik
Copy link
Author

dubravcik commented Jul 1, 2020

I am coming with another idea. I don't know whether it is a problem for others, but for me the problem is that the documentation

  • is in another file
  • have to create the file and also structure in the file manually (sure, there are some helpers Feature: Command to auto-generate schema.yml files #1082 )
  • have to keep it in sync when the model is changed (probably the biggest pain)
  • doesn't really document the code itself

The idea is to have the documentation inside the model using comments. Other programming languages use comments for documentation, so it could work as well.

For example a comment --docs < column > < description > could be parsed and used for documentation. The position of such comment would not be important as it holds the column name, so it would work also in long models with many CTEs. It could be another option to document model and could be overridden by schema.yml
What do you think @drewbanin @jrandrews ?

SELECT
       --docs full_name full name of customer
       full_name
       --docs country country where customer is located based on IP location
       ,country
       --docs browser_language language set in customer browser
       ,browser_language
       --docs date_created_at date customer created
       --this is an unrelated commented to docs
       ,date_created_at
       --docs date_last_login date customer logged in        
       ,date_last_login
  FROM analytics.customer

I haven't thought about column level tests yet.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 4, 2020

I'm thinking about this in the same way I'm thinking about #2401, which is on our 1.0 to-do list, and could be essentially summarized as: "Reconcile node configs and resource properties, where possible and it makes sense."

Today, it's not totally clear what can be defined in one vs. the other—or both (tags). This is a big point of confusion for new dbt users. Of course, we don't want to create more confusion in the process: We'll need to come up with ways to reconcile config-properties that are set hierarchically in .yml files with the ones set in config() blocks. I think in-file config() should always win, but I could also see wanting to raise a warning if there's conflicting information set at the exact same level (e.g. the same column described twice).

Personally, I find myself agreeing with @jrandrews's point above. I think that, by and large, we'll want to maintain the functional distinction that exists today, recast as a convention:

  • Configure the behavior of a model's execution with config() or dbt_project.yml
  • Describe a model's properties in models/*.yml

@tomsej
Copy link

tomsej commented Feb 1, 2021

We had a similar problem and wanted to post an issue. The worst thing for us was that we often forgot to edit the schema file when we added / removed columns. But I think having a description in sql can lead to a lot of commits without changing business logic, let alone merge conflicts. We have created a tool for us that will allow us to check the differences between schema and sql comments + much more. Maybe it can help you - https://github.com/offbi/pre-commit-dbt

@dbt-labs dbt-labs locked and limited conversation to collaborators Apr 19, 2022
@jtcohen6 jtcohen6 converted this issue into discussion #5093 Apr 19, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants