Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add grouping/partitioning to relevant schema tests #450

Closed
6 of 8 tasks
emilyriederer opened this issue Nov 28, 2021 · 1 comment · Fixed by #633
Closed
6 of 8 tasks

Add grouping/partitioning to relevant schema tests #450

emilyriederer opened this issue Nov 28, 2021 · 1 comment · Fixed by #633
Labels
enhancement New feature or request testing

Comments

@emilyriederer
Copy link
Contributor

emilyriederer commented Nov 28, 2021

Describe the feature

Many of the data checks offered by dbt-utils could benefit by having an optional group argument. This would allow checks to be done separately for each subgroup of data so that success/failure could be analyzed at the group level.

Specifically, I believe that this could be relevant to the following checks:

The motivation of this is two-fold:

  • Some checks can only be expressed by group (e.g. subscription dates are only sequential within a user)
  • Some checks are more rigorous by group (e.g. globally a field could be non-constant but it could still be errantly constant within a group)

This issue is explored at greater length in my blog post. The post includes an example (NYC subway turnstile data) where these checks add incremental value. This is a follow up issue encouraged by @joellabes after reading the post.

Describe alternatives you've considered

I recognize there are downsides to this approach, but I believe these can be easily mitigated. These include:

  • Computational cost of grouped checks could be exceedingly high. This could be annoying for DataOps-style testing purposes. However, this could be mitigated by long-running tests being labeled so they could be excluded by dbt test in certain settings
  • API bloat. Any new feature makes an overall tool more complex and less navigable. However, I think any SQL user is very comfortable thinking in terms of groups/partitions so this does not seem overly burdensome.
  • Backwards compatibility. The reasonable default for these checks would be "no grouping", so this should not cause any backwards compatibility issues for existing users of the test modules.

Alternatives to this approach, to me, seem unappealing:

  • Users could write their own custom macros. However, the functionality of these custom macros would by 90% the same as existing dbt-utils functionality and not allow users to benefit from the superior testing and maintenance of dbt-utils
  • Non-grouped checks could be added at an "upstream" point in the data pipeline. In theory, at some point in the data generation process, data may be already being created within a group and could be tested before being unioned to the whole. However, this is not true in many cases due either to the data-generating process or the analytics engineer's ability to move infinitely upstream.
  • We could not do grouped checks. This is the most likely outcome if this feature is not implemented because hard things tend not to get done. Of course, no suite of data checks can test everything, and at some point one has to draw the line. However, I believe this feature adds enough value and is sufficiently general purpose that grouped checks can and should be encouraged (per the "pit of success" philosophy.)

Additional context

No. This feature is not platform-specific.

Who will this benefit?

I believe a wide variety of users can benefit because most data has some sort of inherent grouping structure. My blog post acknowledges not all of these groupings are equally likely to cause novel, group-specific data errors.

For more social proof, there's some evidence of this being a needed feature across data quality tools. It has been implemented in the R package pointblank (biased sample, though, because that was also my feature request) and requested in multiple issues (example) for Great Expectations (not by me this time!)

Are you interested in contributing this feature?

I'd be happy to make a PR modifying the SQL in the schema test macros. If changes are needed other places as well, I might need a push in the right direction.

@emilyriederer emilyriederer added enhancement New feature or request triage labels Nov 28, 2021
@joellabes
Copy link
Contributor

Thanks @emilyriederer for not one but two great writeups on this!

Totally on board with the concept that you should be able to test the quality of each group. I'm almost certain that I've had to do this in the past, but I don't remember how I solved it.

I'm also fine with enabling larger/slower queries, and leaving it to the good judgement of the analytics engineer as to whether it's worth paying the performance penalty to do more in-depth testing.

If you wanted to have a stab at one of the above macros, I'd be happy to work with you to get the integration tests etc up to scratch (there's some basic info here and here, I've got to put it all in a one-stop-shop soon 😬 ). After we've got one working end-to-end, scaling to the others should be pretty straightfoward.

To set realistic expectations, Coalesce and dbt v1 mean that the next couple of weeks are going to be chaotic over here and I probably won't be able to be much help until the dust settles from that! But reach out either on this issue or in the Community Slack if you need any help and I'll get back to you 📬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants