Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] dbt should know about metrics #4071

Closed
1 task done
drewbanin opened this issue Oct 15, 2021 · 48 comments · Fixed by #4235
Closed
1 task done

[Feature] dbt should know about metrics #4071

drewbanin opened this issue Oct 15, 2021 · 48 comments · Fixed by #4235
Labels
enhancement New feature or request

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Oct 15, 2021

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

dbt should know about metrics. A metric is a timeseries aggregation over a table that supports zero or more dimensions. Some examples of metrics include:

  • active users
  • churn rate
  • mrr (monthly recurring revenue)

dbt should support metric definitions as a new node type. Like exposures, metrics participate in the dbt DAG and can be expressed in yaml files. By defining metrics in dbt projects, analytics engineers can encode crucial business logic in tested, version controlled code. Further, these metrics definitions can be exposed to downstream tooling to drive consistency and precision in metric reporting.

The ecosystem

There is some prior art for defining metrics in dbt projects. In particular, see

While these two implementations differ (measures vs. metrics), more on that below, there exists a need in the community for a first-class way to define these metrics in dbt code. It is really neat to see that some folks have already made these definitions possible with dbt, but it would better if metrics were treated as well-defined nodes with field validation and helpful utilities inside of dbt Core.

Specification

A metric is a timeseries aggregation over a table that supports zero or more dimensions. These metrics can be encoded in schema.yml files. In the example below, a new_customers metric is defined as a count of customer records created in a given time grain.

# models/marts/product/schema.yml

version: 2

models:
 - name: dim_customers
   ...

metrics:
  - name: new_customers
    label: New Customers
    model: dim_customers
    description: "The number of paid customers who are using the product"

    type: count
    sql: user_id # superflous here, but shown as an example

    timestamp: signup_date
    time_grains: [day, week, month]

    dimensions:
      - plan
      - country
    
    filters:
      - field: is_paying
        value: true

    meta: {}

Given this information, a downstream process (or a dbt macro!) can generate a sql SELECT statement that correctly calculates this metric with a specified time grain and set of dimensions. Here is a breakdown of supported fields:

Field Description Example Required?
name A unique identifier for the metric new_customers yes
model The dbt model that powers this metric dim_customers yes
label A short for name / label for the metric New Customers no
description Long form, human-readable description for the metric The number of customers who.... no
type The type of calculation to perform when evaluating a metric count_distinct yes
sql The expression to aggregate/calculate over user_id yes
timestamp The time-based component of the metric signup_date yes
time_grains One or more "grains" at which the metric can be evaluated [day, week, month] yes
dimensions A list of dimensions to group or filter the metric by [plan, country] no
filters A list of filters to apply before calculating the metric See below no
meta Arbitrary key/value store {team: Finance} no

Model reference

A reference to a dbt model. This model may be any "materialized" model, or a reference to an ephemeral model. Direct table references are not allowed, and alternate node types (seeds, snapshots) are not supported.

Metric types

The following metric types should be supported:

  • count
  • count_distinct
  • sum
  • average
  • min
  • max

In the future, alternative metric types (ratios, deltas, etc) should be supported in this model.

Filters

Filters should be defined as a list of dictionaries that define predicates for the metric. Filters are ANDed together. If more complex filtering is required, users can (and should) push that logic down into the underlying model.

filters:
  - field: is_paying
    value: true

Functional requirements

  • Metrics should participate in the dbt DAG as a distinct node type
  • Metric nodes should be accessible in the dbt Core compilation context via:
    • the graph.metrics variable
    • one or more accessor functions like metrics.find_by_name('...') (exact mechanism TBD)
  • Metric nodes should be emitted into the manifest.json artifact
  • Metrics should work with partial parsing
  • Metric nodes should be supported in node selection and should be selectable with the metric: selector
    • When listing nodes, existing graph operators (+, &, etc) should be supported
  • (in a different issue) Metrics should be surfaced in the dbt Docs website

dbt Core should not, itself, evaluate or calculate metrics. Instead, dbt Core should expose the definition of metrics to downstream tools or packages for evaluation and analysis. In this way, it is critical that dbt Core provides hooks into metrics that can be leveraged in both macro code, or by processes that consume dbt Core manifest.json files.

Describe alternatives you've considered

Don't implement metrics as distinct node types and keep encoding them in meta properties:

  • This information is untyped and semantically unrepresented in dbt, so it would be a net-improvement to instead create a first-class node type in dbt Core for these logical DAG nodes

Metrics vs. Measures

Metrics are strongly-typed objects. It is extremely common to see folks perform syntactically correct but semantically meaningless calculations over data. This looks like averaging an average, or adding two distinct counts together. You get a number back... but it's not a useful or meaningful result.

To that end, I think we should start with metrics instead of measures. The difference here (and maybe a strawperson of my own creation - tell me if you think so) is that measures are untyped aggregations, whereas metrics are rigorously defined summaries over well-defined datasets. The creation of metrics does not preclude us from teaching dbt about more generic types of aggregations in the future, but I'd prefer to start with a narrow set of functionality and expand over time. It is easy to remove constraints, but it is hard to add them 🙂

Include support for joins

  • Joins make metric calculations really complicated. dbt should absolutely know about foreign key relationships (outside of the existing relationships test) in the future, but this would be a meaningful expansion of scope for our first cut of this feature
  • While these joins would be semantically useful, they are not a blocker to defining metrics today. Join logic can be pushed down into model code (whether materialized or ephemeral). We should experiment with this single table paradigm, see how it feels, and then consider the best approaching for teaching dbt about semantic joins in the future.

Where metrics are defined

Should metrics be a property of a model? While that could be functional today, I think this would make it hard to extend metrics to work with joins (see above). Instead, declaring metrics as independent nodes that participate in the DAG is a more future-proof idea, and we'd probably do well to avoid the "patching" flow required to get schema tests (properties of models today) translated into their own independent nodes in the DAG.

Inheriting configuration from models

Should metrics be namespaced under a model? This would make it possible to define some "shared" properties for all of the metrics derived from a model (eg. valid dimensions, the time field, supported time grain). This would be good for ergonomics, but not a huge value-add IMO. I'd like to keep this simple for the initial implementation and then make decisions like this with some more information from the community around example use-cases.

Example:

metrics:
  - model: dim_customers
    # dimensions are shared for all metrics defined in terms of this model
    dimensions:
      - country
      - plan

    definitions:
      - name: new_customers
      - name: churned_customers

SQL calculations in dimensions

Should dimensions be allowed to provide arbitrary SQL expressions? I don't think so — that SQL is best encoded in model code, and it would be confusing and dissonant to break up dimension definitions across SQL and yaml files.

Example:

metrics:
  - name: dim_customers

    # This logic should be represented in the underlying model
    dimensions:
      - field: plan_type
        sql: case when plan in ('pro', 'extra pro') then 'paid' else 'free' end
        

Who will this benefit?

Analytics engineers
- As with models, AEs will be able to define metric logic under version control. By colocating model and metric code, new metrics or changes to existing metrics can be made in a tested, versioned, documented, code reviewed environment
- Further, dbt Core's built in lineage can surface information about how changes to an upstream model may impact a downstream metric

BI/Analytics tooling (and therein, data consumers)
- Organizations use metrics to understand performance and make decisions. To that end, the correctness and precision of these metrics is really paramount! By defining metrics rigorously under version control, and then exposing their definitions globally, dbt Core can help ensure consistency in reporting.

The data ecosystem
- There are so many tools, both existing and yet to be created, that can benefit from an open source mechanism for defining a semantic model on top of the data warehouse. I believe that this business logic is just too valuable and strategically important for end-users to be locked up in proprietary tooling. To that end, this feature, and future types of semantic logic like this, should be addressable in an open source way

Are you interested in contributing this feature?

I sure am :)

Anything else?

From around the internet:

@drewbanin drewbanin added enhancement New feature or request triage labels Oct 15, 2021
@jtcohen6 jtcohen6 removed the triage label Oct 15, 2021
@sisu-callum
Copy link

@drewbanin what is the default null behavior for dimensions? I'll confess I'm thinking about this from Sisu's perspective but would the behavior be the equivalent of a * for all model fields or would it just not have any dimensions?

For our platform, we'd probably want all (or a majority of the columns) so I'm wondering if users would have to manually list them all out. Honestly I can see the perspective of requiring definition for every column and some sort of * so I'm wondering what your thoughts on the subject are.

@willweld
Copy link

willweld commented Oct 15, 2021

Hey @drewbanin! Incoming novel for ya ;)

Additional Metric Properties:

  • format: the presentation format of the metric (e.g. $, %)
  • precision: the number of decimal points to round to
  • metric_source: the source platform for the metric
  • is_source_of_truth: boolean indicating if this metric comes from the source of truth
  • pop_increase: indicates whether a period over period increase is good or bad
  • alert_threshold: if the metric value is above/below a certain number for a time frame and/or if the metric goes +/- X% PoP

re: dimensionality defined at the metric level
Metrics are often consumed at multiple levels of dimensionality (e.g. revenue all-up, revenue by state, revenue by customer status, etc.)

If we're including dimensions as part of the metric definition, then the metric property should be dimension_sets:
dimension_sets:

  • [] (empty array --> no dimensions)
  • [plan] --> metric by plan only
  • [country] --> metric by country only
  • [plan, country] --> metric by plan, country

But this raises the question -- should dimensions be defined at the metric level??
There's a difference between how a metric is defined vs. how a metric is consumed. Part of me feels like the metrics should be defined at 0 dimension granularity. When you start applying different sets of dimensionality to a metric, those are ways in which the metric is consumed, which, can and will be different for different users. Ultimately, what I'm getting at is, if we're building a metrics layer, we should also think about building two additional layers:

  • (1) the insight/answer/widget layer and
  • (2) the report layer

Insight/Answer/Widget Layer:
This layer would be a separate layer that invokes/consumes the metric layer. The metric layer could act as an API and insight/answer/widget layer would essentially pass different dimensions to apply to the metric layer. I suppose this also begs the question if the different date range granularities should be defined in the metric layer or elsewhere. Here's an ugly example:

revenue_by_plan:

  • source_metric: revenue
  • dimensions: [plan]
  • time_periods: [7d, 30d, 90d]

revenue_by_state:

  • source_metric: revenue
  • dimensions: [state]
  • time_periods: [7d, 30d, 90d]

revenue_by_plan_state:

  • source_metric: revenue
  • dimensions: [plan, state]
  • time_periods: [7d, 30d, 90d]

Report Layer:

Consumes the insights/answers/widget layer to include the various metrics/insights. Essentially defines which insights/metrics are part of a report.


Drew: dbt Core should not, itself, evaluate or calculate metrics. Instead, dbt Core should expose the definition of metrics to downstream tools or packages for evaluation and analysis.

Por que no los dos? I think there's a lot of value that can be found here on the pre-caching front, particularly in producing reports that don't require/allow users to query with specific date ranges. For example, I created a macro that populates a table with metrics defined at their various date granularities, such that I have a table like:

  • date
  • metric_value (for this date)
  • metric_value_7d (from this date)
  • metric_value_30d (from this date)
  • metric_value_60d (from this date)

Here's an actual example using a barebones metric layer I defined:

{% macro get_windowed_data() %}
    {% set 
        metric_dict = [
            {
                'column': 'transaction_amount',
                'alias': 'Revenue',
                'method': 'sum',
                'precision': 2,
                'format': '$'           
            },
            {
                'column': 'transaction_id',
                'alias': 'Transactions',
                'method': 'count distinct',
                'precision': 0,
                'format': '#'
            },
            {
                'column': 'customer_id',
                'alias': 'Customers',
                'method': 'count distinct',
                'precision': 0,
                'format': '#'
            }
        ]
    %}

    {% set date_field = 'transaction_date' %}
    {% set date_periods = [7, 30, 60] %}

    SELECT 
    {% for period in date_periods %}    
        {% for metric in metric_dict %}
            {%- if metric.method == 'count distinct' -%}
            , ROUND(COUNT(DISTINCT {{metric.column}}) OVER (ORDER BY {{date_field}} ROWS BETWEEN {{period - 1}} PRECEDING AND CURRENT ROW), {{metric.precision}}) AS {{metric.alias}}_{{period}}d
            {%- else -%}
            , ROUND({{metric.method}}({{metric.column}}), {{metric.precision}}) AS {{metric.alias}}
            {% endif %}
        {% endfor %}
    {% endfor %}
    FROM
        {{ref('mart_customer_transactions')}}
    GROUP BY
        1
    ORDER BY
        1 DESC
{% endmacro %}

Drew: Should dimensions be allowed to provide arbitrary SQL expressions? I don't think so — that SQL is best encoded in model code, and it would be confusing and dissonant to break up dimension definitions across SQL and yaml files.

Should they be allowed? I think so! Should this be the best practice? Absolutely not. Here's why I think it might be a good idea to allow this:

  • I hope this statement resonates well with you but, to me, dbt should be about flexibility and enabling people to do things in ways that work for them and their teams, even if it's not considered the right/best practice way
  • The metrics layer is a new layer in the analytics stack and, with the advent of ALL THE THINGS (to be concise) in this area of practice, the scope of the role of an analytics engineer is growing a little bit out of control. Additionally, you have analytics engineers that are different shades of purple depending on where they are on the red vs. blue spectrum. I think the metrics layer provides an opportunity for those who are less technically inclined to get more involved, particularly in enabling them to do faster data exploration and analysis by tweaking definitions at a metric level rather than the underlying data level.

@owlas
Copy link

owlas commented Oct 15, 2021

This is awesome! And completely agree that it'll be a huge boon to the community and tooling to move toward a standard. It's something we've been really excited about at Lightdash.

I think the proposal looks like a great direction. The metric definitions look familiar to the community too: adding to your list of comparisons for the implementation:

Joins

Agreed these can get confusing. Fortunately, they're also becoming less important, we're seeing teams move to big wide tables (where all models are prejoined for a given analysis). So seems like a reasonable omission

Time-dimension requirement (aka metrics vs measures)

I was interested to see that a time-dimension is a requirement, this excludes a lot of useful aggregates, for example:

  • Which user has purchased the most products? (metric: count purchases, dimension: user id)
  • What was the clickthrough rate on our most recent campaign (metric: click % rate, dimension: campaign name)

Does this relate to your dimension of metrics vs measures:

measures are untyped aggregations, whereas metrics are rigorously defined summaries over well-defined datasets

I found this quote a bit confusing 🤔. Not sure if I understood correctly, is the difference that a metric always has a time dimension and evaluates to a float? (e.g revenue over time, user count over time), hence giving a more rigorous "type"? Whereas a measure is any aggregation giving any type (date, string, timestamp, number)?

Metrics on models

There is definitely some benefit earned because for analysis tables there tends to be a subset of columns that are destined to be dimensions and some destined to become metrics in the semantic layer. The ability to just go through all columns and tag them with dimension / metric-type can make it really easy to bootstap metrics from a dbt model. Just as @sisu-callum mentioned, there are productivity gains to be had.

While joins aren't supported, metrics only map to a single model so I think there's a strong argument to keep them on models. However, if dbt needs to retain flexility to define metrics across models it would make more sense to define them on exposures / a new resource.

SQL implementation

One noticeable omission here is the implementation of how these metrics definitions compile to SQL. I think it would be even more powerful if the metrics definitions were paired with an expected SQL output for each target warehouse. Is that in scope for the project?

It would be troublesome for example if I have my centralised metric definitions but downstream tools interpret those metrics differently. I imagine something like a reference implementation of the complied SQL statements coming out of metrics (e.g. a dbt macro / package).

Or perhaps the idea here (as with dbt models) is that it doesn't take an opinion on how a metric definition becomes SQL and people are free to essentially build their own metrics layers using a common set of metrics definitions.

Last one...

  • Heck no to complex sql definitions for dimensions, this definitely duplicates the holy work of dbt models.

@owlas
Copy link

owlas commented Oct 15, 2021

@willweld

I think these properties:

format: the presentation format of the metric (e.g. $, %)
precision: the number of decimal points to round to

Seem more related to visualisation and I think would be strong candidates for the meta tag. So if a tool like Lightdash wants to allow users to mark up formats, then we can leverage the meta tag. This has proven a great pattern so far.

Thanks @drewbanin for including it!

The meta is where the party is 🎉

@willweld
Copy link

@owlas

Yeah absolutely agree with you. And the other things like pop_increase: [good/bad] could (should) be stored in the meta as well. As long as these attributes do exist and I can parse them out in dbt, I'm a happy camper :)

@tayloramurphy
Copy link

I would echo @owlas's points about the Time-Demension requirement. It feels overly explicit for no clear benefit.

Include support for joins

or

SQL implementation from @owlas

Many moons ago at Meltano we implemented a SQL:1999 compliant algorithm for properly handling JOINs and generating SQL. We call these "Hyper Dimensional Aggregates". The original issue documenting it is in https://gitlab.com/meltano/meltano/-/issues/286 and the SQL:1999 compliance part is in https://gitlab.com/meltano/meltano/-/issues/344

This code is all MIT licensed and could most likely be ported (famous last words) over to dbt if that makes sense. Perhaps @drewbanin you're thinking of having a similar adapter model to SQL generation that dbt has for everything else (a la https://docs.getdbt.com/docs/available-adapters#dbt-labs-supported)

Maybe this discussion is too much for the specification issue, but I wanted to call it out and put that bug in people's brains 🙂

@aaronsteers
Copy link
Contributor

aaronsteers commented Oct 15, 2021

UPDATE: I've logged a separate issue for adding attributes here: #4090

"Metrics on Columns" vs "Metrics on Attributes"

First of all, I completely agree with the value here and general direction, and I agree with much of the great points in the above thread. I do however think we are missing a in important abstraction layer between "tables/columns" and "metrics". A column can only have a data type but an attribute can have a much deeper attribute type. A column is present in a single table, but an attribute has projections throughout the project.

So, I'd propose something along the lines adding both attributes and metrics together, or with attributes arriving slightly before:

  1. Adding support for Attribute metadata
    1. let models provide columns with an attribute_name property (or similar)
    2. let attribute definitions be declared along with attribute type at the top of the project (as sibling to metrics)
    3. let models (and folders of models) optionally auto-map column names to attribute names
  2. Adding support for metrics to be defined on top of attributes:
    1. The attribute type drives which types of metrics can be built, can be statically checked for soundness. (Ex: sum and avg are invalid over a dimension_key attribute.)
    2. The metric may then be mapped to specific models (implicitly or explicitly) and can be statically checked for soundness. (Ex: average_customer_spend is invalid if we don't have both customer_count and total_spend.)
    3. Each metric can have multiple calculation methods, depending on which attributes are available. (Sometimes this will rely on a combination of "fact" attributes and "dimension" attributes.)
    4. Metrics can build on other metrics. (Ex: average_customer_spend could be defined on top of 2 attributes, such as customer_count and total_spend, or on top of 2 metrics with the same names.)

Other notes:

  • Attributes can also have parent-child relationships, such as city->state->region and day->month->(quarter)->year
  • Attributes of dimension_key type would naturally seed distinct counts.
  • Attributes can have default rendering, sorting, and slicing capabilities in the UI.
  • Attributes allow project developers to give an explicit equivalency between columns of the same or similar names, such as two tables that both have sales_region or customer_id.
  • The word attribute is intentionally ambiguous as to whether it drives metrics or dimensions, by design. As alluded to above, customer_id is an attribute that drives both dimensions and metrics, and sales_revenue can similarly be the driver of the sales_revenue metrics as well as as bucketization dimension called sales_tier.

Related conversation here in the Lightdash repo, as we recently discussed swapping the terms dimensions for attributes there as well: lightdash/lightdash#457

Example with Column->Attribute->Metric mappings

Expanding the above with a specific strawman example:

attributes:
- name: customer_id
  label: Customer ID
  attribute_type: dimension_key
  column_names:
  - customer_id
  models:
  # as much or as little constraint to get ontologically the right column mappings
  - marts.*
  - !marts.the_weird_one
- name: customer_count
  label: Customer Count
  description: A pre-aggregated count of customers.
  attribute_type: entity_count
  column_names:
  - customer_id
  models:
  # everywhere we see this column in the `marts` folder, map it to this attribute
  - marts.*
- name: country
  label: Country Name
  attribute_type: dimension_key
  models:
  # everywhere we see this column in the `marts` folder, map it to this attribute
  - marts.*
  column_names:
  - country
  - country_name
- name: sales_revenue
  label: Sales Revenue
  attribute_type: amount
  column_names:
  - sales_revenue
  models:
  - marts.*
metrics:
- name: Number of Customers
  calculations:
  - expression: count(distinct {{ attribute('customer_id') }})
    default: true
    models:
    - dim_customers
  - expression: sum({{ attribute('customer_count') }})
    models:
    - fact_customers
    - fact_customers_by_region
- name: Number of Countries
  calculations:
  # `attribute('country')` resolves to either `country` or `country_name`, depending on the model
  - expression: count(distinct {{ attribute('country') }})
    default: true
    models:
    - dim_customers
    - dim_countries
- name: Sales Revenue
  calculations:
  - expression: sum({{ attribute('sales_revenue') }})
    models:
    # Anywhere the attribute is available
    - *
- name: Average Customer Revenue
  calculations:
  - expression: {{ metric('Sales Revenue') }} / {{ metric('Number of Customers') }}
    # Anywhere both metrics are available
    models:
    - *
- name: Predicted Year-End Revenue
  # This calc is built in the BI layer and is not able to be defined in dbt.
  # Instead of `expression`, we provide `depends_on` as a list of attributes.
  depends_on:
  - customer_favorite_color
  - customer_join_date
  - customer_ytd_sales
  models:
  - marts.predictions.fancy_ai_table
- name: Super Fancy AI Prediction
  # This calc is built in the BI layer and is not able to be defined in dbt.
  # Instead of `expression`, we provide `depends_on` as a list of attributes.
  depends_on:
  - customer_favorite_color
  - customer_join_date
  models:
  - marts.predictions.fancy_ai_table

@miike
Copy link

miike commented Oct 15, 2021

Seem more related to visualisation and I think would be strong candidates for the meta tag.

Agreed @owlas format / precision are a good fit for meta as they seem like presentation attributes rather than core to the metric. That said I'd love to see units as a first class citizen as this is something critical to the metric definition (kilometers / meters, seconds / milliseconds etc).

Not sure if I understood correctly, is the difference that a metric always has a time dimension and evaluates to a float?

Not sure I fully grasped this either. I think there are certainly instances (and some you've mentioned) where a metric (or measure?) may not necessarily evaluate to a float, but may still be quite useful (e.g., the mode of a categorical variable).

@georgewfraser
Copy link

georgewfraser commented Oct 16, 2021

I would love for dbt to support metrics, mostly so that we could include metric definitions in Fivetran dbt packages. I take issue with your definition of metrics though: metrics can exist without time series. For example, if we have a database of species, “countif(kingdom = ‘animalia’)” is a metric which counts the number of animal species. This isn’t just a technical issue: there are lots of business metrics which are time-series in principle but in practice we choose to calculate at just the present time, because it’s simpler.

@buremba
Copy link
Contributor

buremba commented Oct 16, 2021

Great to see that dbt will support metric definitions! The proposal looks great and we're specifically interested in exposing these metrics to the downstream tools including SQL clients, BI tools, and data science tools. There are a couple of limitations to the proposed features:

dimensions

Unfortunately, most of the BI tools don't support this feature so most of the BI tools might be missing this information. We can ideally throw a user-friendly exception in the implementation as a workaround for now. I see the use case and am happy to implement it in our spec in case dbt has this feature but I'm not sure if it's a common use case and I believe that the name might be confusing for some users. I would suggest using something like allowed_dimensions or so.

time_grains and timestamp

Not all the metrics are time-series metrics and I believe that both of the properties should be optional. If time_grains is defined and timestamp is not, we can use all the available time frames, that's what we're doing at Metriql. I'm in favor of defining them under models rather than explicitly for metrics even though it's not flexible compared to the suggested approach. Most BI tools support time-series datasets but it's not at the metrics level and requiring the users to define them for each metric might be verbose in most cases. What about having a mapping property (example in Metriql) for dbt resources (models, sources, seeds) so that we can define them at dataset level and provide an option for the users to override it for individual metrics?

Features

filters

It looks good and we already support filters. I see two different implementations for the filters though:

  1. Having an additional property such as operator to define the filter operation. (Metriql follow this approach)
  2. Or let the user define the value and figure out the operation with some special characters such as -. (Looker example)

Even though we follow 1, we're happy to implement 2 for better readability but we should probably think more about the coverage and user experience.

BI-tool specific properties under meta

Each BI tool has its own naming and semantic types and we believe that they should also be defined in YML files. It's one of the most used features of Metriql (see example) but it's hard to standardize so they can fit under meta property I guess.

approx_distinct aggregation type

Modern cloud data warehouses support approximation already and it's the preferred approach for big datasets. It's also additive unlike count_distinct and I believe that we should support both aggregation types. We can either have something like approximate of Looker and use approximation if the aggregation type is count_distinct but we're in favor of adding as an additional aggregation type for convenience since some of the databases don't support it.

Join relations

I believe that metric stores should support joins but I see the reasoning here. We aim to help analytics engineers automate creating dbt models for dimensional modeling and denormalizing the datasets into wide tables might be confusing for some use-cases. We can extend the standard in our implementation and denormalize the dimensions & metrics when exposing them to the downstream tools as we have customers using join relations at the moment.

@alexanderdean
Copy link

Supportive, this would mean Snowplow dbt packages can have the metrics built-in. As package authors who already have multiple data stores/SQL flavors to support 💸, it would be great to see market standardization at the metrics definition layer 💯!

@carlineng
Copy link

Some things that popped into my mind while reading this --

  • Will the metric node type always be a leaf node? I can imagine wanting to materialize a specific cut of a metric as a model, or define another metric such as a ratio that uses two metrics as inputs.
  • I agree with previous comments that viz/format related properties shouldn't necessarily be top-level keys.
  • I believe that implementing some kind of * functionality to include all model fields as dimensions would require dbt to know quite a bit more about models than it currently does. From my understanding, dbt doesn't actually know much about the DB table underlying a model outside of what is listed in the schema.yml file. Resolving * to a list of actual columns would require compiling the query itself. There are also many cases where columns don't make sense as dimensions, e.g., a "total_revenue" column.
  • I'm interested to see more about the interface for querying metrics.

@joellabes
Copy link
Contributor

🎩 Written without my dbt Labs employee hat on, not that it changes anything 🎩

dbt Core should not, itself, evaluate or calculate metrics. Instead, dbt Core should expose the definition of metrics to downstream tools or packages for evaluation and analysis. In this way, it is critical that dbt Core provides hooks into metrics that can be leveraged in both macro code, or by processes that consume dbt Core manifest.json files.

This seems to imply outsourcing the SQL generation that actually calculates the metric to each tool. Does that mean that (best case) the engineers at $BI tool 1 might write more efficient SQL than $BI tool 2 or (worst case) $BI tool 1 might generate results that disagree with the same metric as rendered in $BI tool 2?

Or does

packages

above imply a reference implementation that vendors deviate from at their own risk?

@aaronsteers
Copy link
Contributor

aaronsteers commented Oct 18, 2021

dbt Core should not, itself, evaluate or calculate metrics

I think this goes (only slightly) too far. How about this instead?:

dbt Core may or may not be able to evaluate metrics, depending upon the nature of the metric.

For simple things like a "Number of Customers" or "Sales Revenue" metric, unless we intentionally sabotage the metric definition to be overly vague and non-prescriptive, there are only artificial barriers preventing dbt (or a dbt macro/extension/fork) from providing basic calculation results - and more valuably, metric-based unit tests.

Rather than say "dbt Core should not" evaluate metrics, I think it's preferrable to just keep with "dbt Core may or may not be able to evaluate any given metric".

In my suggestions above, I give examples of metrics of three types:

  1. Metrics which are basically aliasing sql columns, but with a specific aggregation function applied. (Sum(num_customers), sum(sales_revenue), etc.
    • This first tier may and probably should be testable by dbt (eventually).
  2. Metrics which cannot meaningfully be stored in tabular format, especially ratios, because they reference a complex correlation and are non-additive in nature. (Avg. Revenue per customer, Avg. Purchase Price, Distinct Customer Count, etc.)
    • This second tier, also, may be calculated by dbt but not meaningfully stored in a tabular format. If these are to be tested, it may only by at specific dimension levels, such as "for each sales region", or "total level", or "for the month of July 2021"
  3. Metrics which are "black box" in nature due to algorithmic complexity. (Examples: "Predicted Year-End Revenue", "Churn Risk Score", etc.)
    • This third tier is completely opaque to dbt, and has no meaningful declaration except "this metric exists and it depends on x, y, and z"

@jpmont
Copy link

jpmont commented Oct 18, 2021

$BI tool 1 might generate results that disagree with the same metric as rendered in $BI tool 2?

Regarding tests for metrics, it could still be valuable to have expectations defined in dbt (even if not evaluated in dbt) and used as a reference for the BI tools to guarantee correctness.

@owlas
Copy link

owlas commented Oct 18, 2021

@jpmont exactly. This is how we have always imagined a metric standard, the combination of:

  1. A standard metrics definition (as yaml/json)
  2. A reference implementation of a metric -> SQL interpreter
  3. A test suite or benchmark for proprietary implementations of the interpreter (2) to ensure consistency across BI tools

Alternatively I like @aaronsteers point, which I took to be, if the metrics definitions are simple enough such that the SQL mapping is trivial, then we could reduce errors.

For simple things like a "Number of Customers" or "Sales Revenue" metric, unless we intentionally sabotage the metric definition to be overly vague and non-prescriptive, there are only artificial barriers preventing dbt (or a dbt macro/extension/fork) from providing basic calculation results - and more valuably, metric-based unit tests.

@buremba
Copy link
Contributor

buremba commented Oct 18, 2021

🎩 Written without my dbt Labs employee hat on, not that it changes anything 🎩

dbt Core should not, itself, evaluate or calculate metrics. Instead, dbt Core should expose the definition of metrics to downstream tools or packages for evaluation and analysis. In this way, it is critical that dbt Core provides hooks into metrics that can be leveraged in both macro code, or by processes that consume dbt Core manifest.json files.

This seems to imply outsourcing the SQL generation that actually calculates the metric to each tool. Does that mean that (best case) the engineers at $BI tool 1 might write more efficient SQL than $BI tool 2 or (worst case) $BI tool 1 might generate results that disagree with the same metric as rendered in $BI tool 2?

Or does

packages

above imply a reference implementation that vendors deviate from at their own risk?

I believe that synchronizing the metrics into downstream data tools is not that easy without a middleware layer. All the BI tools that support metrics out of the box has their own way to define the metrics and often there is no one-to-one mapping to SQL expressions:

  1. PowerBI has DAX.
  2. Tableau has calculations.
  3. Metabase has expressions. (dbt-metabase has a pending PR for similar use-case though)
  4. GDS has custom expressions.

Our approach was to build a middleware layer in between the BI tools and the dbt adapter (data warehouse) because we couldn't find a way to support all the measure types by compiling SQL to native metric language for each BI tool. The BI tools see both metrics & dimensions as database columns and dbt resources (seeds, sources, and models) as database tables with Metriql's MQL feature. We use the Trino interface so all the BI tools that support Trino / Prestodb also support Metriql out of the box. Since the databases don't implement metrics, we also maintain an additional integration (example for Superset) for mapping metrics, formatting, etc. in a native way.

That way, the tools can just SQL queries similar to this one:

select o_clerk, total_orders FROM "ref('orders')"

and the engine compiles it to the following SQL query in an ad-hoc way:

SELECT o_clerk, "total_orders" AS "total_orders" FROM (
SELECT 
    "model_my_new_project_orders"."o_clerk" AS "o_clerk",
    count(*) AS "total_orders"
FROM "orders" AS "model_my_new_project_orders"

    GROUP BY
    1 
) AS "model_my_new_project_orders"

If dbt standardizes the metric definitions, the BI vendors and metric stores can just integrate dbt via its Metadata API and generate the SQL in the serving layer. That way, dbt can handle documentation, testing, etc. for the metrics and the data tools can build different interfaces for querying the metrics and just have their own SQL compiler for metric queries.

@fabrice-etanchaud
Copy link

Hi all !

Following @aaronsteers 's comments concerning attributes,

Julian Hyde (now working at looker) did a great job with mondrian, the open source ROLAP engine, trying to bind the relational and analytical worlds, and ended up with cube specifications which are imho worth the reading :

Mondrian Schemas

Anyway, it would be very cool to automatically generate a mondrian schema ontop the future dbt's metrics and dimensions.

@buremba
Copy link
Contributor

buremba commented Oct 18, 2021

@fabrice-etanchaud that's a great idea! I contacted Julian a while ago for a call for help but it looks like it might be a conflict of interest since he's working for Looker. The idea was to parse MDX queries with Olap4j of Mondrian and compile it to SQL for the underlying data warehouse without any caching / storage mechanism, unlike Mondrian. We can generate the dimensional tables as dbt incremental / table models instead of using an additional caching mechanism and reference these roll-up tables in the compiled SQL query.

We have pending work but MDX seems to be much more complex than we imagined. I would love to collaborate if there's any interest.

@fabrice-etanchaud
Copy link

@fabrice-etanchaud that's a great idea! I contacted Julian a while ago for a call for help but it looks like it might be a conflict of interest since he's working for Looker. The idea was to parse MDX queries with Olap4j of Mondrian and compile it to SQL for the underlying data warehouse without any caching / storage mechanism, unlike Mondrian. We can generate the dimensional tables as dbt incremental / table models instead of using an additional caching mechanism and reference these roll-up tables in the compiled SQL query.

We have pending work but MDX seems to be much more complex than we imagined. I would love to collaborate if there's any interest.

Hi @buremba , Glad it makes sense to you ! You are right MDX has (for me) a huge learning curve, and I usually rely on tools like Saiku to build queries.

Did you have a look at dremio ? You can leverage its reflections (kind of materialized lattices, in calcite's world) to seamlessly accelerate requests without setting up mondrian's aggregate tables. dremio can even pre-compute metrics based on the dimensions distinct values.

@buremba
Copy link
Contributor

buremba commented Oct 18, 2021

@fabrice-etanchaud I didn't know about dremio's reflections but it looks like they store the data in external storage rather than in the same data warehouse, which is the main point of dbt. The idea is to create dbt models and utilize them rather than moving data from your data warehouse to an OLAP engine. I believe that it's more dbt-native and it aligns with the modern data stack since it's we will be pushing all the work into the data warehouse. dbt and the metrics layers can potentially replace the OLAP solutions with less overhead, I wrote an opinionated blog post about this topic recently.

@carlineng
Copy link

For simple things like a "Number of Customers" or "Sales Revenue" metric, unless we intentionally sabotage the metric definition to be overly vague and non-prescriptive, there are only artificial barriers preventing dbt (or a dbt macro/extension/fork) from providing basic calculation results - and more valuably, metric-based unit tests.

Rather than say "dbt Core should not" evaluate metrics, I think it's preferrable to just keep with "dbt Core may or may not be able to evaluate any given metric".

Metrics are always generated with respect to a particular time period and dimension set, so in order for dbt to evaluate arbitrary metrics, it would need a server component that can listen for requests, compile the request to SQL, and return the SQL. Right now, dbt Core itself does not have a long-lived server component that can respond to requests like these.

I like the idea of allowing dbt Core to materialize a static parameterization of a metric as a model, e.g., specify "Sales Revenue by day, by territory". If someone is interested in running unit tests on a metric, they probably are trying to do so over a well-understood parameterization of that metric, and won't need dynamic cuts.

@fabrice-etanchaud
Copy link

fabrice-etanchaud commented Oct 18, 2021

@fabrice-etanchaud I didn't know about dremio's reflections but it looks like they store the data in external storage rather than in the same data warehouse, which is the main point of dbt. The idea is to create dbt models and utilize them rather than moving data from your data warehouse to an OLAP engine. I believe that it's more dbt-native and it aligns with the modern data stack since it's we will be pushing all the work into the data warehouse. dbt and the metrics layers can potentially replace the OLAP solutions with less overhead, I wrote an opinionated blog post about this topic recently.

Hi @buremba , dremio can be used as a datawarehouse (it stores its data in parquet files or even in iceberg format ). As it connects to your enterprise sources, there is little if no need for pipelines. You are right, if you already have a dwh, dremio would add an extra layer.

I just had a look at Metriql ! In fact it seems you are using trino's federation as a proxy to rewrite queries, just like dremio is acting , great !

Best

@aaronsteers
Copy link
Contributor

aaronsteers commented Oct 18, 2021

A couple quick updates on recent threads:

  1. I've opened a new issue to discuss attributes here: [Feature] dbt should know about Attributes (foundation for Metrics/Dimensions) #4090
  2. If attributes define each table's time grain and intersectionality, this makes the metric definition much clearer and more easy to use across multiple aggregate tables.
    • Meaning, timestamp-grain tables can use the same metric definitions as day/month/year aggregates. Only the time-grain attribute needs to vary from table to table - and we can let the BI calculation engine figure out how to handle it.
  3. @buremba and @fabrice-etanchaud (dodging the MDX expression language topic for now 😄 ), I do think attributes gets us much closer to a mondrian-esque metadata or "schema".

While we're not trying to build a ROLAP query or caching engine, we are trying to establish a ROLAP framing - just by nature of ROLAP or "Relational OLAP" being the (somewhat dated now) name for what happens when you calculate analytical metrics directly over tabular sources.

@acossta
Copy link

acossta commented Oct 18, 2021

Love the direction this is going. Two thoughts on the extensibility:

Metric extensibility

You should be able to define a metric in terms of another. This helps keep definitions DRY. A total_sales_germany should be defined on top of total_sales, not as a different metric.

There is a significant benefit for this: the sales metrics will use the same root, so you won't end up in a situation where total_sales_germany includes coupons, but total_sales_italy does not.

You could imagine something like:

metrics:
  - name: total_sales_germany
    type: sum
    parent_metric: total_sales
    dimensions:
      - product
      - plan
    filters:
      - field: country
        value: Germany

The filter field has to be a dimension on the parent metric and the dimensions present in the parent metric.

You could also imagine defining ratios from two existing metrics. You'd be able to visualize the DAG of metrics.

+1 for including an operator in the filter.

Metric type extensibility

The different metric types will need different settings. For example, a count metric does not take any setting, while a count_distinct has to define the field to apply the distinct to. Similarly, the approx_distinct could be a setting of the count_distinct metric.

You could imagine something like:

metrics:
  - name: unique_customers
    type: count_distinct
    settings:
      - distinct_field: customer_id
      - approximate: true
    dimensions:
      - segment

A strongly-typed settings object for each metric type reduces the error surface area significantly.

@drewbanin
Copy link
Contributor Author

I am just loving this thread! Thank you all so much for contributing your thoughts and expertise in here -- I am wondering if I should have started with a public google doc instead of an issue 😂 .This is what I'm hearing and what I'm thinking:

Timestamps should be optional
There are legitimate reasons why users might not want to define a metric with a timeseries component, and we should make the timestamp and time_grains fields optional in the specification. Specifically, timestamp is optional, and time_grains may be provided if a timestamp field is specified.

Behavior for dimensions ✅ ❓

I am picturing the dimensions property as a set of allowable dimensions for a metric. There are cases where some columns in a model might not be suitable or sensible to group or filter by. This information can be used by downstream BI/reporting/analytics tools to present UX to consumers of these metrics. Here's a quick example:

analytics.fct_orders

order_id country ordered_at order_total_usd
1 US 2021-10-23 42
1 CA 2021-10-23 37

Here, country is a valid dimension for a revenue metric, but order_id is not.

@sisu-callum I am open to the idea of syntactic sugar for including all columns from a model as dimensions, but my preference would be to leave this out of the first cut of this implementation. For one, the set of all columns in the underlying model is going to include the column that's being aggregated over (eg. order_total_usd in the example above), so you would probably not actually want all dimensions (unless your metric type is a count, maybe).

@carlineng correctly notes that dbt doesn't know all of the columns in a model a priori, but we can fetch the list of columns defined for a model in a schema.yml file. This feels like a reasonable tradeoff, as we really would want to make sure that the dimensions which power metrics are well-documented and well-tested. Either way, I'd like to leave that consideration out of scope for this ticket, and possibly pursue it in the future if it remains desirable!

@aaronsteers thanks for opening a new issue to discuss attributes -- we'll follow up with you in that ticket! I really like the concept that you're describing and would be excited to discuss it further.

Meta properties

I think we're all aligned here - information around presentation/formatting, units, 📈 / 📉 , and more can live in arbitrary configs of the meta: property of a metric. I can totally imagine folding in some common parameters into the metric definition itself in the future, but I think we should start with a small set of fields in the specification and grow from there.

Metrics defined by other metrics ⚠️

I love the idea of defining metrics in terms of other metrics. This would be important and helpful for a metric like "LTV to CAC", or more broadly: ratios & deltas. I don't have a good answer for how this could work in the near-term, but I think it's a pretty essential piece of the metrics puzzle.

I am less swayed by @acossta's suggestion of "derived" metrics like total_sales --> total_sales_germany. @acossta - what's the benefit of defining two different metrics, rather than a single metric with a country dimension?

You could also imagine defining ratios from two existing metrics. You'd be able to visualize the DAG of metrics.

Just exceptionally with you on this part though :)

Metrics vs macros ✅ ❓

Seeing/hearing some questions/confusion around this line:

measures are untyped aggregations, whereas metrics are rigorously defined summaries over well-defined datasets

I think @miike calls this out well above -- I do not think that a mode of a categorical variable, or a listagg, or similar, are metrics per se. I think they're valuable, but I don't believe they should be represented in this specification, at least not initially. I'd definitely be interested in finding ways to support these kinds of aggregations in the future, but think that numeric-based aggregations are going to help keep things simple for our first cut of the implementation.

Filters

Ok, I'm with you all - filters should have an operator field. I can tell you that I don't love the idea of enumerating filters in the metric definition and will strongly suggest in the eventual documentation that non-trivial filters are pushed down into an underlying model.

On that note - I am excited to see us all develop best practices for what goes into a model definition vs a metric definition. @jthandy - can I sign you up for v2 of this blog post 😂

Wrap up

I think we're in good shape on all of the topics raised in this thread. If anything feels unresolved to anyone, please feel very encouraged to follow up below! For my part.... next steps here involve putting up a PR :)

@acossta
Copy link

acossta commented Oct 28, 2021

I am interested in developing good heuristics about what ought to be defined in a metric vs. a model. My instinct is that we want thin metric definitions and thick models, but I'm super open to discussion :)

@drewbanin - it could be defined as a model and I agree with you on the thick model, thin metrics mantra. There might be other solutions, but the underlying goal is to be able to show the DAG of metrics (just metrics) to a broader audience (business users) that don't necessarily operate at the model level but do at the metric level and care about the metric lineage.

@mistercrunch
Copy link

mistercrunch commented Oct 28, 2021

Hey, great to see this initiative, and reporting in from the Apache Superset community.

Thinking about DBT's information architecture, and how it is very model-centric, I'm really interested in how we can add metrics (namely as "aggregate expressions" as in SUM(unit_sold) or COUNT(DISTINCT userid)) and their associated semantics (label, description, type) to existing DBT models. Superset, and most BI tools for that matter, have a place to store and expose those, and we'd like to be able to let our users manage those in source-control/DBT when they desire. When your models are defined in DBT, that's clearly where the semantics around that model belong, and we want to enable people to store it there and sync them.

For reference, here's where metrics are defined in the Superset GUI.
Screen Shot 2021-10-27 at 5 32 42 PM

From our perspective a "metric" is an any aggregate expression that is coupled with a given "dataset" (a Superset dataset is essentially the same as a DBT model). This seems to align with the way that MetriQL thinks about them, and in what Lightdash calls "measures" I believe.

Clearly there's a need for this as there's at least 3 projects interested in fitting this under the meta: catchall


I'm also interest in "dimensions" (god isn't that term loaded!?), but here I'd like to call them "virtual columns" as simple valid expressions that people may not want to materialize for whatever reason. Good examples of that may be expressions pointing to key locations in JSON-blobs or MAP type fields (oh aren't they common nowadays!?), or a regex on an email address to extract the domain. There's clearly value in extending a DBT model with an array of valid expression, with pretty labels and markdown-rich descriptions. For us that maps with our concept of "calculated dimensions".


I'd love for DBT to be opinionated enough about these two concept and allow us to enrich models in DBT with things that are relevant not only to BI tools, but to anyone consuming these datasets. Basically declaring a bunch of SQL expressions (aggregate or not) that are super useful and relevant and documented around the model. Yes!

What we don't want is for something like

models:
  - banana_orders
    meta:
      superset:
        metrics:
          {superset's view on how this should be expressed}
      metriql:
        {some slightly different way to express this}
      lightdash:
        measures:
        {some slightly different way to express this}

Now about bringing metrics as a new "super-object" in the DBT information architecture is an interesting idea too, and clearly conceptually metrics do transcend models, but how they do this is intricate to say the least. For me I think I'm thinking about how this idea of "super-metrics" (as a concept that transcend models) relates to the idea of model-specific metrics as I mentioned above.

Also the idea of "cubing" data (as in performing aggregations / summary models) declaratively is interesting to me but a bit out of what I think as core DBT scope. Could cubing logic be expressed less natively and as a reusable macro or macro-like constructs (say a multi-model generator) instead?

# this would spit out many models
cubify(ref('banana_orders'), grouping_sets=[['dim1'], ['dim1', 'dim2']], metrics=['sales'])

I've built a few cubing frameworks in the past and they are intricate more intricate than they may seem at first. The notion of "time granularity" is a rabbit hole, and non-additive metrics can be a significant challenge as sometimes they are additive across certain dimensions only (distinct users is aggregatable across user attributes for instance).


Anyhow, I'll stay away from the cubing conversation and anchor on the first portion of this: "model-centric expressions" is what I came here to advocate for!

One question here: should we decouple the cubing conversation from the one about "model-centric expressions" and start another discussion? To me "model-centric expressions" is probably much simpler, and I'm much more interested in that one. Trying not to hijack the current thread's intent too much...

I also do think that "model-centric expressions" is a VERY natural first step to enable DBT's models to grow towards becoming a [universal] semantic layer.

@fivetran-joemarkiewicz
Copy link
Contributor

This thread is awesome and sparks joy ✨

To echo @georgewfraser point above

I would love for dbt to support metrics, mostly so that we could include metric definitions in Fivetran dbt packages.

Ever since I started seeing the infamous "Metrics Layer" pop up in the community I have been wondering how the Fivetran dbt packages (and dbt packages in general) can support metric definitions, and this feature can make that a reality! I have already been thinking about other specific package use cases for metrics as well and just created a Feature Request detailing some of my thoughts around how metrics can become more versatile within dbt packages. TLDR on the FR: I am wondering how metric definitions could be added or overridden in dbt packages by users.

Super excited to stay up to date on this thread and feature request!

@buremba
Copy link
Contributor

buremba commented Oct 28, 2021

From @mistercrunch:

I also do think that "model-centric expressions" is a VERY natural first step to enable DBT's models to grow towards becoming a [universal] semantic layer.

I agree with this approach because most of the downstream tools don't talk in terms of metrics, they connect to databases so the user flow is usually database -> schema -> table -> dimensions/metrics. Even though it seems more natural for the metrics layer to treat metrics as first-class citizens, I believe that the metrics layer can only make sense if it can offer integrations with all the downstream tools (BI tools, SQL clients, notebooks, etc.) and it's really hard to integrate these tools without model-centric approach.

Our approach is to let users define the cubes as dimension / metric pairs as part of the model definitions and hide them from the downstream tools. We automatically create dbt models from each cube and make use of them by parsing the queries and re-writing them to point the roll-up tables under the hood. The dbt resources (models/seeds/sources) are exposed to BI tools as the semantic layer.

@owlas
Copy link

owlas commented Oct 29, 2021

Thanks @mistercrunch for bringing such clarity! Some questions:

Basically declaring a bunch of SQL expressions (aggregate or not) that are super useful and relevant and documented around the model

Absolutely agree here and I took from @drewbanin 's first comment that this was the intention of the original design. Although the metrics are outside of models: and have their own metrics: tag, each metric does map to one (and only one) model. What was the motivation for having the metrics declared outside the model @drewbanin ? Is it to enable the possibility for them to later be declared across models?

metrics:
  - name: new_customers
    label: New Customers
    model: dim_customers
    description: "The number of paid customers who are using the product"

There's clearly value in extending a DBT model with an array of valid expression, with pretty labels and markdown-rich descriptions.

I would challenge the idea of having virtual columns calculated in dbt .yml files. Is this sufficiently different from having a model of type view, which contains all my dimensions and columns needed for metrics? I think virtual columns already have a home in dbt .sql models.

This would be different for calculated metrics

  • calculated dimensions: transformations that take place prior to rollups should live in the dbt model itself
  • calculated metrics: transformations that take place after rollups should live in a metric/semantic layer and are declared in yml

@buremba
Copy link
Contributor

buremba commented Oct 29, 2021

From @owlas:

I would challenge the idea of having virtual columns calculated in dbt .yml files. Is this sufficiently different from having a model of type view, which contains all my dimensions and columns needed for metrics? I think virtual columns already have a home in dbt .sql models.

I believe that there are some good use-cases of virtual columns / dimensions because of the dynamic nature of it.

  1. .sql files are compiled once so we won't be able to change the expression in a dynamic way at runtime in serving layer. For example, Looker supports Liquid parameters in dimension definitions and the data analysts reference variables in their dimension such as the user timezone, the report fields (see in_query) to build complex reports such as cohorts, funnels, etc.

  2. In some cases, we might not want to materialize the columns in the database due to performance reasons especially when we're using row-oriented databases such as Postgresql.

@mistercrunch
Copy link

I would challenge the idea of having virtual columns calculated in dbt .yml files. Is this sufficiently different from having a model of type view, which contains all my dimensions and columns needed for metrics? I think virtual columns already have a home in dbt .sql models.

I'm not convinced as you then would have to systematically duplicate models for the sake of these "virtual columns" (which are pretty common imho), potentially doubling the size of your DAGs creating more indirections. Maybe DBT compiles in that form physically in the database (with an extra view on top), but hopefully we don't have to create a new file and carry the list of fields forward (or whatever DRY version of that) for that sake.

@bashyroger
Copy link

On the topic of Joins:
While this indeed can get complicated, I think support is vital as it makes it possible do build a proper Dimensional Model / Kimball star model with conformed dimensions.

A first implementation of join logic could be made easier in the following way:

  • Joins are only valid when the are made between a Fact table and Dimension table, They therefore should be many to one always!
  • Metric calculations can only be created on the central fact table

As in: you should NOT want to implement logic to resolve casm / fan traps! (https://biguru.wordpress.com/2008/05/01/its-all-in-the-universe-handling-chasm-and-fan-traps/)

@nhandel
Copy link

nhandel commented Nov 4, 2021

Nick here from transform.co. I want to share some learnings that we're building off of at Transform from working on metrics tools at Airbnb. We would be eager consumers of these definitions and support the development of a more universal spec. A snippet of our current spec can be seen here. We'll do a better job of pushing out the full spec and the opinions we're building on soon.

At a high level, I'm excited about this direction for dbt. Centralizing this information and creating strong ties with the data transformation pipelines is an exciting direction. The opportunity to connect metrics to their data lineage through the entire data modeling process and then downstream is compelling and will benefit end-users. This also presents an exciting opportunity to solve one of the more significant challenges with metrics frameworks, building out a dependency graph.

Some good ideas in this thread will push this in the right direction, but some gaps are fairly similar to the gaps that Airbnb ran into with early iterations. I'm skeptical of the idea that starting simply and then evolving this spec will work. Metric specs, in particular, are complicated because of the number of end applications that consume from them and the flexibility required to serve them all. There have been multiple backward-incompatible evolutions of metrics frameworks across Airbnb and other companies, and the ones I participated in were all painful. This is a lot of logic to migrate, and it's not always 1-1.

Philosophical Level

Before diving in too deep, I'm sensing a philosophical gap between some of the writing in this thread and how larger companies who have built this internally have used this type of technology. Specifically, Airbnb's framework (and the older version) evolved to be thought of as a "denormalization engine" and I get the sense that there are advocates here for a different direction (no virtualization, push to rely on manual modeling for complexities, dimensions and joins as a second step in spec design, etc.). I fear that the impact will be limited if end-users are expected to define these metrics on top of denormalized tables. That's not to say that these ideas are right or wrong, but they do deviate from a path that has been tested. If the context on the evolution of Airbnb's spec is helpful, I'm happy to share more.

Here is a good explanation of why you should pass flexibility to frameworks like this. The virtualization of logic is seen as a plus for some of the reasons @mistercrunch mentioned above. If the abstractions are chosen thoughtfully, you can end up with DRYer code and a more flexible framework that can handle more of the boiler-plate logic associated with metric construction.

Joins / Identifiers

Joins are by far the most complicated component of this concept. The core difference between the way we have thought about this vs. others in this space and the posts I've seen here is that we build the joins for the end-user rather than the framework built on top of this spec needing to be robust enough to handle fan-out and chasm joins. I don't think symmetric aggregates in SQL construction are necessary. We get around this by using "identifiers," which are the join keys and their types (primary, foreign, unique) as joinable entities rather than defining join paths one by one.

The framework determines the join that the requested data set requires. This is a much more flexible and DRY abstraction that enables the framework to manage the complexity around joins. This pushes the optimization to the framework with all of the above benefits around declarative languages with flexible planning engines.

It's probably not worth digging in too deep here since this appears to be a future discussion point, but this is one of the more significant differences of how we've approached building Transform relative to frameworks like LookML or most other semantic layers.

Dimensions

First, since this is a loaded and ambiguous term, I define a dimension as an attribute of an entity/identifier... i.e., user's country, product's type, etc., or that entity itself. I have pretty strong opinions on this one -- A metrics spec should treat a dimension as a first-class object. Treating a dimension as an attribute of the metric will become quite repetitive as companies adopt this at any form of scale. This goes back to the philosophical idea above, but if you are required to pull dimensions into every metric, business consumers will become frustrated, and the analytics engineer will become a blocker to the outcomes this layer aims to enable.

Hierarchies: Folks have shared some interesting ideas here. This is an important concept and shouldn't be ignored. The expression of time granularities is a dimension hierarchy. There are many more types of dimension hierarchies, and these should be expressible by the user. Everyone's business has unique dimension hierarchies(e.g., Sales Region → Territory → Market), and this needs to be an abstraction of its own that looks similar to a nestable case statement to get real DRY. One way to think about a dimensional hierarchy is that it is a discretely aggregatable object rather than a measure that is a continuously aggregatable object.

Time granularity: Except for a point-in-time metric (e.g., 2021 revenue or 2021 revenue by country), metrics only make sense in the context of time. We've had a lot of debate about representing time and whether time itself should be a component of the metric. Some metrics require time in their definition (i.e., a cumulative metric), and some may make sense in the context of a single timestamp (transactions by last updated vs. transactions by created at)

Measure and Metrics

The differentiation you identified in your first post between measures and metrics is close to how we defined these two objects, although we use both. Put simply, metrics are functions of measures. The measure & metric abstraction allows us to maintain a flatter dependency structure and increase the reuse of objects.

We define measures as aggregatable expressions (columns or SQL statements that fit in a select) and an aggregation type. What is defined in the original spec as a metric would be measure and a measure metric type in our framework. We define metrics as functions that take measures and other parameters as inputs. They can be as simple as a proxy of a measure or more complicated like cumulative or rolling window metrics.

Reusing metrics or deriving metrics from other metrics is another idea that seems appealing, but it can become quite a mess, both technically and organizationally. There are a few reasons, but the most challenging part is generating nested metrics and joining in dimensions at different aggregation levels.

Summary

The goal of a spec in my mind should be to create a sufficient semantic understanding of the warehouse that tools can build the logic necessary to serve metrics to various applications. There are many applications for metrics, and that is a tall order and ultimately the challenge of defining a universal spec. That spec should also probably be exchangeable between data applications for them to do their analytical work.

I would love to help with this but also don't want to grandstand on these beliefs if there is a desire to go in a different direction. That was a very high level and possibly too short to fully articulate these points. There was so much pain and frustration learning how to make these configs organized and simple, and I would love to help shortcut that.

@gshank gshank mentioned this issue Nov 8, 2021
4 tasks
@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 8, 2021

We've got a first stab at teaching dbt about metrics (#4235), which we're going to be including in v1.0.0-rc1. This is an intentionally limited schema, just enough for this to be an experimental feature. We're reserving the right to add more over time.

Major-version one isn't usually the time for brand-new experimental features. We want to keep innovating, even as we lock in the foundation; and it's because of all our other efforts—toward stability, performance, and maturity—that we feel able to take this swing.

Needless to say, this conversation is far from over. I'm so appreciative of all your time and thoughtfulness in the threads above, and I'm excited to see our thinking on this topic continue to develop over the next weeks and months. I know there are already some exciting follow-on issues, to which I owe responses :)

gshank added a commit that referenced this issue Nov 8, 2021
* first cut at supporting metrics definitions

* teach dbt about metrics

* wip

* support partial parsing for metrics

* working on tests

* Fix some tests

* Add partial parsing metrics test

* Fix some more tests

* Update CHANGELOG.md

* Fix partial parsing yaml file to correct model syntax

Co-authored-by: Drew Banin <drew@fishtownanalytics.com>
jtcohen6 pushed a commit that referenced this issue Nov 9, 2021
* first cut at supporting metrics definitions

* teach dbt about metrics

* wip

* support partial parsing for metrics

* working on tests

* Fix some tests

* Add partial parsing metrics test

* Fix some more tests

* Update CHANGELOG.md

* Fix partial parsing yaml file to correct model syntax

Co-authored-by: Drew Banin <drew@fishtownanalytics.com>
@fabrice-etanchaud
Copy link

Concerning the implementation of "cubing/metrics" whatever the way they will be specified in dbt, it could be interesting to see how the unified star schema paradigm based on SQL UNIONing could help.
dbt's DAG and puppini's Oriented Data Model notion superpose and that could systematically prevent data traps.

@bashyroger
Copy link

One note on the naming of this functionality @jtcohen6 , I would propose to call this feature the (BI) semantic layer or Information semantic layer instead of the metrics layer.
A bit more abstract indeed, but a better fit.
Why? When I think about metrics, _I don't necessarily think about this functionality also incorporating 'logic' for joins, dimension hierarchy's and all the other more complex logic that is discussed in this thread... And I think most agree it should

@mike-weinberg
Copy link

Just putting this out there:

we cannot avoid joins forever

To weigh in a little more concretely, I strongly favor the semantics defined by @owlas in which dimensions/measures are defined on the model-column, and relationships are defined in exposures, albeit a first class semantics tag in place of the meta tag would be nice.

While the INTP in me would like to see an explicit syntax for asserting foreign keys in model-column properties so that lattices can be generated, the reality of most businesses is this:

just because a lattice is possible that does not mean that lattice is useful,

and so that pretty strongly suggests that using exposures to explicitly define intentional explores is a pragmatic solution

iknox-fa pushed a commit that referenced this issue Feb 8, 2022
* first cut at supporting metrics definitions

* teach dbt about metrics

* wip

* support partial parsing for metrics

* working on tests

* Fix some tests

* Add partial parsing metrics test

* Fix some more tests

* Update CHANGELOG.md

* Fix partial parsing yaml file to correct model syntax

Co-authored-by: Drew Banin <drew@fishtownanalytics.com>

automatic commit by git-black, original commits:
  855419d
iknox-fa pushed a commit that referenced this issue Feb 8, 2022
* first cut at supporting metrics definitions

* teach dbt about metrics

* wip

* support partial parsing for metrics

* working on tests

* Fix some tests

* Add partial parsing metrics test

* Fix some more tests

* Update CHANGELOG.md

* Fix partial parsing yaml file to correct model syntax

Co-authored-by: Drew Banin <drew@fishtownanalytics.com>

automatic commit by git-black, original commits:
  31c88f9
  6c6649f
  726004b
  855419d
  c69f28e
  d60f6bc
iknox-fa pushed a commit that referenced this issue Feb 8, 2022
* first cut at supporting metrics definitions

* teach dbt about metrics

* wip

* support partial parsing for metrics

* working on tests

* Fix some tests

* Add partial parsing metrics test

* Fix some more tests

* Update CHANGELOG.md

* Fix partial parsing yaml file to correct model syntax

Co-authored-by: Drew Banin <drew@fishtownanalytics.com>

automatic commit by git-black, original commits:
  2142e52
  56ae206
  855419d
@igorlukanin
Copy link

Integrations with dbt Metrics incoming!

I think that @owlas was the first one to mention Cube in this thread (thanks!), and I'm happy to share that now Cube is able to read metrics definitions from dbt and provide caching, access control, and a variety of APIs for downstream data consumers like BI tools, notebooks, custom apps, embedded analytics, etc.

Here's some details on this integration (https://cube.dev/blog/dbt-metrics-meet-cube/) and there's also a topic on dbt's forum where we can discuss this further (https://discourse.getdbt.com/t/combining-dbt-metrics-with-api-caching-and-access-control/4033).

@abdel-87
Copy link

Very happy with the integration released by Cube JS!!

To get back on the core of the dbt metrics feature.
When you look at metrics from a Star Schema perspective, something that is very widely used as a presentation layer for data warehouses, Joins are inevitable to have in scope.

Take just this simple example (where I appreciate the way Cube JS implemented support for joins), when you want to use dbt metrics as the source of truth.

metrics:

  • Total Revenue
  • Average Revenue

Dimensions:

  • Customer Country (ref('dim_customer'))
  • Product Category (ref('dim_product'))

You would want to calculate your metrics, and break them down by the specified dimensions that are not part of the current model. Dbt Metrics, should not force users to denormalize the star schema into one big table to be able to use the dbt metrics feature, what is now unfurtunately the case.

I think it would be very helpful to look on the Cube JS implementation, and take the lessons learned from there to be able to provide the same support for Joins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.