Add grouping/partitioning to relevant schema tests #450

emilyriederer · 2021-11-28T14:52:44Z

Describe the feature

Many of the data checks offered by dbt-utils could benefit by having an optional group argument. This would allow checks to be done separately for each subgroup of data so that success/failure could be analyzed at the group level.

Specifically, I believe that this could be relevant to the following checks:

equal_rowcount()
recency()
at_least_one()
not_constant()
sequential_values() (mentioned in Add partition argument to sequential_values test #447 )
non_null_proportion()
cardinality_equality()
unique_where() - Note in 2022: moved to dbt-core

The motivation of this is two-fold:

Some checks can only be expressed by group (e.g. subscription dates are only sequential within a user)
Some checks are more rigorous by group (e.g. globally a field could be non-constant but it could still be errantly constant within a group)

This issue is explored at greater length in my blog post. The post includes an example (NYC subway turnstile data) where these checks add incremental value. This is a follow up issue encouraged by @joellabes after reading the post.

Describe alternatives you've considered

I recognize there are downsides to this approach, but I believe these can be easily mitigated. These include:

Computational cost of grouped checks could be exceedingly high. This could be annoying for DataOps-style testing purposes. However, this could be mitigated by long-running tests being labeled so they could be excluded by dbt test in certain settings
API bloat. Any new feature makes an overall tool more complex and less navigable. However, I think any SQL user is very comfortable thinking in terms of groups/partitions so this does not seem overly burdensome.
Backwards compatibility. The reasonable default for these checks would be "no grouping", so this should not cause any backwards compatibility issues for existing users of the test modules.

Alternatives to this approach, to me, seem unappealing:

Users could write their own custom macros. However, the functionality of these custom macros would by 90% the same as existing dbt-utils functionality and not allow users to benefit from the superior testing and maintenance of dbt-utils
Non-grouped checks could be added at an "upstream" point in the data pipeline. In theory, at some point in the data generation process, data may be already being created within a group and could be tested before being unioned to the whole. However, this is not true in many cases due either to the data-generating process or the analytics engineer's ability to move infinitely upstream.
We could not do grouped checks. This is the most likely outcome if this feature is not implemented because hard things tend not to get done. Of course, no suite of data checks can test everything, and at some point one has to draw the line. However, I believe this feature adds enough value and is sufficiently general purpose that grouped checks can and should be encouraged (per the "pit of success" philosophy.)

Additional context

No. This feature is not platform-specific.

Who will this benefit?

I believe a wide variety of users can benefit because most data has some sort of inherent grouping structure. My blog post acknowledges not all of these groupings are equally likely to cause novel, group-specific data errors.

For more social proof, there's some evidence of this being a needed feature across data quality tools. It has been implemented in the R package pointblank (biased sample, though, because that was also my feature request) and requested in multiple issues (example) for Great Expectations (not by me this time!)

Are you interested in contributing this feature?

I'd be happy to make a PR modifying the SQL in the schema test macros. If changes are needed other places as well, I might need a push in the right direction.

The text was updated successfully, but these errors were encountered:

joellabes · 2021-11-28T20:51:44Z

Thanks @emilyriederer for not one but two great writeups on this!

Totally on board with the concept that you should be able to test the quality of each group. I'm almost certain that I've had to do this in the past, but I don't remember how I solved it.

I'm also fine with enabling larger/slower queries, and leaving it to the good judgement of the analytics engineer as to whether it's worth paying the performance penalty to do more in-depth testing.

If you wanted to have a stab at one of the above macros, I'd be happy to work with you to get the integration tests etc up to scratch (there's some basic info here and here, I've got to put it all in a one-stop-shop soon 😬 ). After we've got one working end-to-end, scaling to the others should be pretty straightfoward.

To set realistic expectations, Coalesce and dbt v1 mean that the next couple of weeks are going to be chaotic over here and I probably won't be able to be much help until the dust settles from that! But reach out either on this issue or in the Community Slack if you need any help and I'll get back to you 📬

emilyriederer added enhancement New feature or request triage labels Nov 28, 2021

joellabes added testing and removed triage labels Nov 28, 2021

joellabes mentioned this issue Nov 28, 2021

Add partition argument to sequential_values test #447

Closed

emilyriederer mentioned this issue Nov 29, 2021

Implement schema tests by group/partition (WIP - not ready for review) #451

Closed

15 tasks

emilyriederer mentioned this issue Aug 8, 2022

Implement group_by_columns argument for relevant tests #633

Merged

16 tasks

joellabes closed this as completed in #633 Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add grouping/partitioning to relevant schema tests #450

Add grouping/partitioning to relevant schema tests #450

emilyriederer commented Nov 28, 2021 •

edited

Loading

joellabes commented Nov 28, 2021

Add grouping/partitioning to relevant schema tests #450

Add grouping/partitioning to relevant schema tests #450

Comments

emilyriederer commented Nov 28, 2021 • edited Loading

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

joellabes commented Nov 28, 2021

emilyriederer commented Nov 28, 2021 •

edited

Loading