-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Derived metric can result in null if one of the components is null #764
Comments
Thank you for providing all of the context behind the issue as you've experienced it, it's very helpful! I agree zero-substitution would be the correct thing to do in most of these cases. The main issue we face with defaulting to zero-substitution, even for cases like this, is there are scenarios where a NULL indicates a data error and zero-substitution would mask it with a return value that might be within a range where the final metric appears to be correct even though it is actually wrong. These scenarios are not always detectable from semantic analysis of the expr value itself (which we don't do anyway), and so we're kind of stuck with returning NULL by default in order to communicate to end consumers that something is, indeed, broken. Of course our proposed solution in #759 adds configuration overhead for the many cases where returning 0 is more correct than returning NULL, and it puts people in an awkward place where they're getting results that are clearly wrong from some queries but not others until they add the relevant configuration parameter. Such annoying configuration quirks are things we can and should address. To that end, I think there is a feature request in here which we should consider once #759 is available, which is to find a way to simplify these more complex derived metric configurations for exprs that are basically just adding things together. The idea is something similar to RATIO, but for other common operations. That metric type (or types) could conceivably allow for a default override for zero-substitution, making it much more parsimonious than tagging every input metric separately. If this all seems reasonable I'll be happy to update the description here - with the excellent context you've provided as an example of where having to opt in to zero substitution on every input metric gets painful and perhaps a bit silly - as a feature request for better ways to configure derived metric expressions for common operations, including support for more natural zero-substitution behavior. Please let us know what you think! |
So you are proposing a generic feature to define null behaviour + dedicated derived metric types for known operations where the null behaviour has already been configured appropriately. Sounds good to me. Feel free to update the issue, but it would be nice if the original content did not get completely lost :) |
The first of these (defining null behavior as specified in #759) has been released with MetricFlow 0.203.0. Just to preserve history here I'm going to close this issue and open a new backlog feature request for the follow-up of making it easier to define metric types that do zero-coalescing when it makes sense. |
Is this a new bug in metricflow?
Current Behavior
There is a derived metric:
derived_metric = metric1 + metric2
After some debugging I noticed that if metric2 is null then derived_metric is also null. Which makes sense technically, but in case of this particular metric does not at all make sense business wise. The example I'm working with is something like
total_price = base_price + extra_fees
and obviously I don't want the extra fees being null make the total price null.
Probably related #759
Expected Behavior
A derived metric that adds two or more metrics should not result in null if one of the components is null.
Steps To Reproduce
Generating sample data:
Notice that test_order_id = 3 does not have a row for delivery price.
Semantic model and metrics:
Query without group by:
Returns the expected output:
Query with group by:
Returns total_price = null for 2023-08-03
although it should be 15.
I have reviewed the compiled SQL (only for total_price, less code to review).
This is the original code:
I have a modified version that would output the expected result.
I have made two changes:
I am not 100% sure that this would be a good solution for all types of derived metrics, but for
sum
type metrics this seems like a good approach.Relevant log output
No response
Environment
Which database are you using?
No response
Additional Context
Databricks
The text was updated successfully, but these errors were encountered: