-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dbt metrics submodule #28
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % some comments
integration_tests/dbt_metrics/models/trino__custom_calendar.sql
Outdated
Show resolved
Hide resolved
metric_testing_models: | ||
+materialized: table | ||
|
||
# no median function in Trino |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could use other function to calculate median
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use approx_percentile function. But in order to calculate median for odd and even number of values in the column, we need to add workaround, as using just approx_percentile(col_name, 0.50)
for even number of values in the column is not working (workaround source):
Create sample table with even nr of rows:
CREATE TABLE memory.default.test_table AS (
select 1 AS col_name
union all
select 2 AS col_name
);
Calculate median:
SELECT
IF (
COUNT(*) % 2 = 0,
(approx_percentile(col_name, 0.499999) + approx_percentile(col_name, 0.50))/2.0,
approx_percentile(col_name, 0.50)
) AS median_value
FROM memory.default.test_table;
So, it is a bit of code to calculate median in Trino. For sure the best way would be to implement MEDIAN
in Trino. There is issue for that, but it is very stale and gained no attention trinodb/trino#6309
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just create a follow-up issue and include this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything wrong with getting median by value_at_quantile(tdigest_agg(col_name), 0.5)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @damian3031
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeungHuLee your solution has the same downside as using approx_percentile alone - it doesn't work if number of rows is even, and two middle values are not the same numbers. In this case, function should return arithmetic mean of these middle values. In your solution, it returns the bigger value.
So, running this query on table which I defined in my previous comment:
SELECT
value_at_quantile(tdigest_agg(col_name), 0.5) as median_value
FROM memory.default.test_table;
returns 2.0
, but correct result is 1.5
.
That's the reason why above mentioned workaround is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@damian3031 Thanks for detailed answer! I'm not really a statistics expert, so I was just using tdigest methods according to following:
trinodb/trino#5158
trinodb/trino#4975
https://arxiv.org/pdf/1902.04023.pdf
By the way, when I use the workaround suggested above, for me the median result changes slightly everytime I execute the same query.
Is there more stable or accurate solution suggested?
It is needed to override fact_orders_source seed, as Trino can't cast string formatted 'mm/dd/yyyy' to date.
9f7f4f8
to
87ae9d7
Compare
Tests are red. |
1b85a3b
to
3969d67
Compare
3969d67
to
1efeddd
Compare
resolves #26
Add dbt metrics in dbt-trino-utils:
trino__fact_orders_source.csv
, which have date format that is understood by Trino