Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] dbt should know about Attributes (foundation for Metrics/Dimensions) #4090

Closed
1 task done
aaronsteers opened this issue Oct 18, 2021 · 3 comments
Closed
1 task done
Labels
discussion enhancement New feature or request

Comments

@aaronsteers
Copy link
Contributor

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Following from #4071, I called out in my comment that I believe metrics would be better established on top of "attributes" instead of building directly on "columns" and "models" abstractions in dbt.

As a foundation for deeper metadata understandings within dbt and to unify the documentation effort for existing dbt projects, we should first establish some type of ontological definition of what columns "mean", as they relate to an analytical framing.

Proposal

  1. We add top-level configurability for "attributes" within dbt core.
    1. An attribute maps to one or more columns in one or more tables (models), and it defines the nature of the column in an ontological manner.
    2. An attribute may have an "attribute type", such as dimension_key, dimension_property, fact, etc.
    3. An attribute definition can define how a table interacts with the time dimension, which is needed for certain metric-related applications. (For instance, if a table has an attribute as_of_date marked as the primary temporal attribute for the table, this will inform how metrics calculations can be performed - and at what grain they are possible.)
    4. An attribute may have a text description, which will appear in the docs on any tables (models) where it is referenced.
    5. Attributes may take part in an attribute hierarchy (day->month->year, city->state->country, etc.)
    6. Users may use dbt docs (or other tools) to find references to tables. (I want to see all the tables that contain the "sales_revenue" attribute.)
  2. We provide easy ways to auto-map attributes to columns. For instance, for an attribute called sales_revenue, we could set an auto_map_by_name: true property to find and map all references of the column sales_revenue, or we could explicitly map to models and column references.

Benefits

  1. Over time, best practices will emerge related to column namings to reduce the work needed for attribute mapping, which in turn improves the readability of the project overall. (For instance, this discourages "sales_region" on one table to have an integer ID and "sales_region" on another table to contain the string with the name of the region.)
  2. We can build metrics ([Feature] dbt should know about metrics #4071) more easily on top of attributes. For example, "customers per region" can be more easily defined once "customer_id", "number of customers", "region_id", and "number of regions" are already known and clearly defined attributes.

Sample Code

Adapted from my comment on the related metrics topic.

attributes:
- name: customer_id
  label: Customer ID
  attribute_type: dimension_key
  column_names:
  - customer_id
  models:
  # as much or as little constraint to get ontologically the right column mappings
  - marts.*
  - !marts.the_weird_one
- name: customer_count
  label: Customer Count
  description: A pre-aggregated count of customers.
  attribute_type: entity_count
  column_names:
  - customer_id
  models:
  # everywhere we see this column in the `marts` folder, map it to this attribute
  - marts.*
- name: country
  label: Country Name
  attribute_type: dimension_key
  models:
  # everywhere we see this column in the `marts` folder, map it to this attribute
  - marts.*
  column_names:
  - country
  - country_name
- name: sales_revenue
  label: Sales Revenue
  attribute_type: amount
  column_names:
  - sales_revenue
  models:
  - marts.*

Describe alternatives you've considered

Metrics: The alternatives for metrics is to create mappings directly over all tables and columns. The greater the number of tables of different aggregation levels, and/or projections for query optimization, the larger the redundancy of those metrics mappings will be.

Documentation: This is actually stemming from another inquiry I ran into a couple years ago: how to document all columns in all models, without having to put the same text description on every single instance. (And then, how to keep them up to date as you want to update how "sales revenue" is calculated on all of them.) As far as I'm aware, there isn't yet a good solution for this documentation problem, and so my general guidance has been "don't worry about descriptions on columns" - because it's just too much work and no single-source-of-truth to keep them in sync across tables.

Adding attributes would hopefully change that, since the many places the column exists, it will always carry (or link to) the same text description for users of the project.

Who will this benefit?

This would benefit teams who want column-level descriptions for themselves and their users.

This benefits users, because they can better documentation on columns, and better understand equivalency (or lack thereof) of similarly named column across a project.

Are you interested in contributing this feature?

Sure!

Anything else?

Prior art

Inspired by OLAP platforms and BI layers which support ROLAP capabilities:

  • MicroStrategy (called "facts" there)
  • Mondrian ("attributes" / "dimensions")
  • SSAS ("attributes" / "properties" / "dimensions")
@tnightengale
Copy link

At my current company I have a script that generates yaml for a table which has been created via dbt run and adds in any existing descriptions that match the column name from the latest local manifest.json. The ethos is: if a column is called the same thing between models then it means the same thing.

I feel like if we want to solve consistent documentation, we should perhaps consider a new builtin CLI command? Something that could generate docs with existing descriptions, as well as point out inconsistent definitions, and perhaps edit the yml directly with any updates.

I'm leaning that way because I think attributes as they are proposed above are sort of trying to be too many things: are we solving metric definitions or consistent docs? Let's be explicit and narrow in on the problem we want to solve 😄

@david-kubecka
Copy link

I would like to offer another PoV from a GoodData company (I'm an employee there) that has implemented its own ROLAP engine (historically inspired by MicroStrategy). Same as the issue author, we (and our customers) see great value of building metrics on top of an abstraction over the physical DB (instead of directly over DB columns). We would like to offer our longterm experience in this area to help shape the universal semantic layer design. Let me briefly outline our approach.

What you outline here is basically what we in GoodData call a Logical Data Model (LDM). The LDM consists of datasets (similar to the outputs of dbt models, or simply DB tables/views) which in turn consist of attributes (dimension_key in your proposal) and facts (an umbrella term of numeric aggregatable values - like amount or entity_count). So what you call attributes here we would proably label as LDM objects (if there ever is a need to create an umbreall term for that).

Once an LDM is set up the end users can then define their metrics using facts and attributes via a custom language called MAQL. A typical simple metric could look like

SUM(Price) WHERE Product = "Shoe"

Such a metric could then be used in many different reports, in each of them broken down by any attributes "compatible" with the Price fact, e.g. Product, Customer, Campaign. (For more info about LDM you can read the intro docs).

I hope that our approach could be of some inspiration in your proposed design which I generally like quite a lot, especially the overall direction of your thinking. I may be biased, though :-) Couple of questions:

  • I know that's this is just a very early design but do you have an idea about possible values of attribute_type? Also, in your example you have dimension_key and dimension_property. What use case do you see behind distinguishing between those with respect to metric layer?
  • You propose the ability to define a hierarchy of attributes. How would that work with respect to models? Must the hierarchy be always within single model or could it cross the model boundaries?
  • Regarding hierarchies again: What use case do you see in them? In GoodData we indeed utilize this concept quite heavily e.g. for reusable ratio metrics which can be broken down exactly in the hiearchy corresponding to the denominator attribute. I've not seen this much elsewhere, though, so I'm interested in your ideas.

Anyway, thanks for a great kick off!

@jtcohen6 jtcohen6 self-assigned this Jan 25, 2022
@jtcohen6 jtcohen6 removed the triage label Jan 25, 2022
@jtcohen6
Copy link
Contributor

Big thank you @aaronsteers for the clear and thorough write-up, and @tnightengale @david-kubecka for the thoughtful replies!

I'm way overdue for responding to this one, in part because it sent me down a flurry of mental paths, and it's taken me a while to gather my thoughts together into anything serviceable. I believe the move into defining metrics in dbt-core necessitates a move from defining how models are created to defining how models ought to be queried. This integrates a lot of different thoughts that are still half-baked—so read the following as a mix of inspiration and guiding principle, without expectation of any concrete vision or immediate action.

Semantically meaningful column properties ought to be inherited, and not re-typed every single time. That inheritance can run in two ways:

  • Via configuration, with consistent project-level baselines, plus clear override where appropriate (!marts.the_weird_one in your example). This would work similarly to how functional model configuration takes place in dbt today: very hierarchical, very top-down, very YAML. To Teghan's point, there's a degree of enforced consistency that we want here. When two columns have the same name, they should represent the same thing (data type, concept, ...). I would be happy about dbt-core or companion tooling enforcing that expectation, in the same way we can enforce expectations around test + documentation coverage. Making it possible to configure these things, DRY-ly, in all the places users expect, makes it more reasonable, in turn, to expect them to be applied everywhere.
  • Via lineage: Today, models implicitly inherit all column values/types/etc from upstream models—but that lineage is not known to dbt, or stated anywhere explicitly. This is a big ask from folks who find themselves typing the same descriptions, tests, and meta properties over and over again (Doc (and potentially, Test) Inheritance #2995). I don't believe we're going to be able to do the good version of this without first building capabilities for inferring column-level lineage (Column level lineage #4458), and extract as much useful information as possible out of the SQL that folks are already writing.

The separation/combination I described above is something I want to pursue, and it's something we can do with existing constructs—models and columns—plus a few new capabilities. So, do we need another construct here? What does a notion of an attribute get us?

I've become increasingly convinced that there's real value in defining column "types." The closest thing to column "types" in dbt today is a mix of descriptions (achieved with reusable docs blocks) and generic tests (bear with me). Those tests imply a certain set of commonly understood properties—uniqueness, never null, only certain values, never negative, never overlapping—and those implied properties are, to human eyes, semantically meaningful attributes.

So, imagine a revenue "type", which defines:

  • database column type (int)
  • tags, meta properties
  • standard set of metrics, appropriate to this column type (e.g. total_revenue), including behavior around timestamps
  • set of tests/expectations — never null, never more than a too-big number
  • standard set of grants, policy tags, column masking behavior, ...?
  • standard description — could it be templated and take arguments?

Then we work toward a synthesized approach whereby:

  • The sales_revenue column is an instance of the revenue type. In the first model where sales_revenue is calculated, we can override or customize each of those type-inherited properties as appropriate / specific to sales
  • Every model with sales_revenue in it derives its properties from upstream in the DAG (left-right), via lineage; and/or the consistency of those properties is enforced across the DAG (up-down), via metadata
  • Via metadata, we can surface a centralized glossary (Add the ability to define a terms section or glossary #2860) that combines types, all the columns of that type, and all the models in which each column appears
  • (To put the pie all the way into the sky) dbt could run "static type checks," before runtime, based on metadata/summary queries against source tables and analyzing model SQL, whether

Lots of questions:

  • Should we try to make some types available out of the box, to aid in standardizing? Or trust that the standardization can happen via popular packages?
  • Should "types" be able to inherit from each other? What if we want both revenue and sales_revenue to be types, with the latter as a child type of the former?
  • Should "types" be domain-specific only? Or should they also include functionally meaningful designations, that are less domain-specific—think primary_key + foreign_key—that would give dbt enhanced capabilities around materialization + downstream querying? This is what you propose in the (optional) dimension_key
  • Is this meaningfully different from the original proposal? Or am I proposing a half-baked middle way, a compromise between strong centralization of all important column definitions and the free-for-all that we see today, a center which does not hold?

Ok! This is just the beginning. I'm going to turn this into a discussion, in the hope that more folks join in :)

@dbt-labs dbt-labs locked and limited conversation to collaborators Jan 25, 2022
@jtcohen6 jtcohen6 converted this issue into discussion #4621 Jan 25, 2022
@jtcohen6 jtcohen6 removed their assignment Nov 14, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants