You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many of the data checks offered by dbt-utils could benefit by having an optional group argument. This would allow checks to be done separately for each subgroup of data so that success/failure could be analyzed at the group level.
Specifically, I believe that this could be relevant to the following checks:
Some checks can only be expressed by group (e.g. subscription dates are only sequential within a user)
Some checks are more rigorous by group (e.g. globally a field could be non-constant but it could still be errantly constant within a group)
This issue is explored at greater length in my blog post. The post includes an example (NYC subway turnstile data) where these checks add incremental value. This is a follow up issue encouraged by @joellabes after reading the post.
Describe alternatives you've considered
I recognize there are downsides to this approach, but I believe these can be easily mitigated. These include:
Computational cost of grouped checks could be exceedingly high. This could be annoying for DataOps-style testing purposes. However, this could be mitigated by long-running tests being labeled so they could be excluded by dbt test in certain settings
API bloat. Any new feature makes an overall tool more complex and less navigable. However, I think any SQL user is very comfortable thinking in terms of groups/partitions so this does not seem overly burdensome.
Backwards compatibility. The reasonable default for these checks would be "no grouping", so this should not cause any backwards compatibility issues for existing users of the test modules.
Alternatives to this approach, to me, seem unappealing:
Users could write their own custom macros. However, the functionality of these custom macros would by 90% the same as existing dbt-utils functionality and not allow users to benefit from the superior testing and maintenance of dbt-utils
Non-grouped checks could be added at an "upstream" point in the data pipeline. In theory, at some point in the data generation process, data may be already being created within a group and could be tested before being unioned to the whole. However, this is not true in many cases due either to the data-generating process or the analytics engineer's ability to move infinitely upstream.
We could not do grouped checks. This is the most likely outcome if this feature is not implemented because hard things tend not to get done. Of course, no suite of data checks can test everything, and at some point one has to draw the line. However, I believe this feature adds enough value and is sufficiently general purpose that grouped checks can and should be encouraged (per the "pit of success" philosophy.)
Additional context
No. This feature is not platform-specific.
Who will this benefit?
I believe a wide variety of users can benefit because most data has some sort of inherent grouping structure. My blog post acknowledges not all of these groupings are equally likely to cause novel, group-specific data errors.
For more social proof, there's some evidence of this being a needed feature across data quality tools. It has been implemented in the R package pointblank (biased sample, though, because that was also my feature request) and requested in multiple issues (example) for Great Expectations (not by me this time!)
Are you interested in contributing this feature?
I'd be happy to make a PR modifying the SQL in the schema test macros. If changes are needed other places as well, I might need a push in the right direction.
The text was updated successfully, but these errors were encountered:
Thanks @emilyriederer for not one but two great writeups on this!
Totally on board with the concept that you should be able to test the quality of each group. I'm almost certain that I've had to do this in the past, but I don't remember how I solved it.
I'm also fine with enabling larger/slower queries, and leaving it to the good judgement of the analytics engineer as to whether it's worth paying the performance penalty to do more in-depth testing.
If you wanted to have a stab at one of the above macros, I'd be happy to work with you to get the integration tests etc up to scratch (there's some basic info here and here, I've got to put it all in a one-stop-shop soon 😬 ). After we've got one working end-to-end, scaling to the others should be pretty straightfoward.
To set realistic expectations, Coalesce and dbt v1 mean that the next couple of weeks are going to be chaotic over here and I probably won't be able to be much help until the dust settles from that! But reach out either on this issue or in the Community Slack if you need any help and I'll get back to you 📬
Describe the feature
Many of the data checks offered by
dbt-utils
could benefit by having an optionalgroup
argument. This would allow checks to be done separately for each subgroup of data so that success/failure could be analyzed at the group level.Specifically, I believe that this could be relevant to the following checks:
The motivation of this is two-fold:
This issue is explored at greater length in my blog post. The post includes an example (NYC subway turnstile data) where these checks add incremental value. This is a follow up issue encouraged by @joellabes after reading the post.
Describe alternatives you've considered
I recognize there are downsides to this approach, but I believe these can be easily mitigated. These include:
dbt test
in certain settingsAlternatives to this approach, to me, seem unappealing:
dbt-utils
functionality and not allow users to benefit from the superior testing and maintenance ofdbt-utils
Additional context
No. This feature is not platform-specific.
Who will this benefit?
I believe a wide variety of users can benefit because most data has some sort of inherent grouping structure. My blog post acknowledges not all of these groupings are equally likely to cause novel, group-specific data errors.
For more social proof, there's some evidence of this being a needed feature across data quality tools. It has been implemented in the R package
pointblank
(biased sample, though, because that was also my feature request) and requested in multiple issues (example) forGreat Expectations
(not by me this time!)Are you interested in contributing this feature?
I'd be happy to make a PR modifying the SQL in the schema test macros. If changes are needed other places as well, I might need a push in the right direction.
The text was updated successfully, but these errors were encountered: