Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dbt metrics submodule #28

Merged
merged 6 commits into from
May 26, 2023
Merged

Add dbt metrics submodule #28

merged 6 commits into from
May 26, 2023

Conversation

damian3031
Copy link
Member

@damian3031 damian3031 commented May 18, 2023

resolves #26
Add dbt metrics in dbt-trino-utils:

  • add dbt-metrics submodule
  • bump Trino and SEP versions
  • add integration tests. Lots of models from dbt-metrics needs to be overridden with models pointing to trino__fact_orders_source.csv, which have date format that is understood by Trino
  • add trino shim for dbt-metrics gen_calendar_join macro

@damian3031 damian3031 requested review from aalbu, mdesmet and hovaesco May 18, 2023 08:41
@damian3031 damian3031 linked an issue May 18, 2023 that may be closed by this pull request
Copy link

@hovaesco hovaesco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % some comments

integration_tests/dbt_metrics/dbt_project.yml Show resolved Hide resolved
metric_testing_models:
+materialized: table

# no median function in Trino

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could use other function to calculate median WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use approx_percentile function. But in order to calculate median for odd and even number of values in the column, we need to add workaround, as using just approx_percentile(col_name, 0.50) for even number of values in the column is not working (workaround source):

Create sample table with even nr of rows:

CREATE TABLE memory.default.test_table AS (
select 1 AS col_name
union all
select 2 AS col_name
);

Calculate median:

SELECT
IF (
    COUNT(*) % 2 = 0,
    (approx_percentile(col_name, 0.499999) + approx_percentile(col_name, 0.50))/2.0,
    approx_percentile(col_name, 0.50)
) AS median_value
FROM memory.default.test_table;

So, it is a bit of code to calculate median in Trino. For sure the best way would be to implement MEDIAN in Trino. There is issue for that, but it is very stale and gained no attention trinodb/trino#6309
WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just create a follow-up issue and include this code.

Copy link

@SeungHuLee SeungHuLee Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything wrong with getting median by value_at_quantile(tdigest_agg(col_name), 0.5)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SeungHuLee your solution has the same downside as using approx_percentile alone - it doesn't work if number of rows is even, and two middle values are not the same numbers. In this case, function should return arithmetic mean of these middle values. In your solution, it returns the bigger value.

So, running this query on table which I defined in my previous comment:

SELECT
value_at_quantile(tdigest_agg(col_name), 0.5) as median_value
FROM memory.default.test_table;

returns 2.0, but correct result is 1.5.

That's the reason why above mentioned workaround is needed.

Copy link

@SeungHuLee SeungHuLee Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damian3031 Thanks for detailed answer! I'm not really a statistics expert, so I was just using tdigest methods according to following:
trinodb/trino#5158
trinodb/trino#4975
https://arxiv.org/pdf/1902.04023.pdf
By the way, when I use the workaround suggested above, for me the median result changes slightly everytime I execute the same query.
Is there more stable or accurate solution suggested?

damian3031 and others added 3 commits May 18, 2023 17:09
@damian3031 damian3031 force-pushed the add-dbt-metrics-package-support branch from 9f7f4f8 to 87ae9d7 Compare May 18, 2023 15:09
@hovaesco
Copy link

Tests are red.

@damian3031 damian3031 force-pushed the add-dbt-metrics-package-support branch 2 times, most recently from 1b85a3b to 3969d67 Compare May 19, 2023 09:30
@damian3031 damian3031 force-pushed the add-dbt-metrics-package-support branch from 3969d67 to 1efeddd Compare May 19, 2023 09:43
@hovaesco hovaesco merged commit d281977 into main May 26, 2023
@hovaesco hovaesco deleted the add-dbt-metrics-package-support branch May 26, 2023 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add dbt metrics package support
4 participants