-
Notifications
You must be signed in to change notification settings - Fork 1.7k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unit Testing SQL in DBT #2354
Comments
This indeed would be useful to have for all orgs where a focus on data quality is of the utmost importance. AFAIK this is a hard problem to solve on the people / processes side of things (as you mention) and not something that has been done before for DATA unit testing. |
Happy to find this issue! This is also an enhancement that would be useful to my team. I think this type of testing falls outside of the two existing kinds of dbt tests: schema tests and data tests. I've implemented a form of unit testing in my company's codebase. It currently executes via pytest to test pl/pgsql transformations on Postgres, but I think the technique could be adapted to other databases and dbt. Implementation sketchMy test suite folder looks like this:
The algorithm looks as follows:
The approach is probably similar to what @MichelleArk has reported with CSVs. These tests are cumbersome to setup, and haven't been able to convince my team to do this kind of testing yet. :-) I see that Dataform has unit testing. I guess one advantage of their implementation is that they are generating the test dataset in the database. Since I am defining the data in YAML, there could be issues translating data types from YAML into the database under test. |
Hello! I would love to see this feature as part of dbt core. I created a small https://discourse.getdbt.com/t/dbt-model-think-unit-tests-poc/2160 One of the core design constraints was the ability to exercise models one at a time. This means that The framework needed to provide some mechanism for stubbing our ref/source. the approach I took with the mvp listed above was to namespace the stubbed tables with a prefix, which is set as an environmental variable. The following describes the logical steps the mvp test harness takes to stub out ref/source and provide test defined data:
this allows very focused model (“unit”) tests. Tests configure a couple of rows of stub data, exercise the model, and then assert on the output using a python dataframe. This allows for targeted, fast testing of model transformation code. I’m most likely going to move forward with this approach at dayjob. If anyone is interesting it should be relatively easy to convert this python approach to a “configuration” yaml approach. I would love to hear your thoughts. |
Hi! I wanted continuing the conversation from #2740 I have been playing around with a way to automate this and I have a working concept here: https://github.com/jmriego/dbt-bdd The tests are run with behave which is a library for BDD testing which in my opinion is a great fit for DBT as it makes the tests easy to understand by analysts the same way it already does for ELT. Scenario: run a sample unit test Then, it will replace all ref to calendar with abcd124_calendar. This is really the main concept and I didn't find a better solution, but it does so by passing to dbt a var with the following key and value: {calendar: abcd124}. The code that detects the reference is here: https://github.com/jmriego/dbt-bdd/blob/master/macros/ref.sql I'm seeing @dm03514 you also created something similar but with pytest |
We've been experimenting with unit tests. We've decided to (probably) use SQL mocks rather than seeds because they're faster. For any given model we have a __source.sql/s and a __expected.sql. It took me a while to come to realise that there's a fundamental paradox; either I have to deploy the model and change it's sources... or I have to have different "versions" of the model itself pointing at different sources... because the source needs to be instantiated and the model deployed before it can be tested. Ideally, though this would be easier to control with config. |
Hi @reubster! How are you creating those SQL mocks? Do you mean that people writing the tests need to create fake source models and expected values with a SQL query similar to this?
|
Getting (static) test data into the database is something that will be test dependent; sometimes you want small test data and then writing SQL that mocks the data is doable (select ... union all select...), medium data sets fit well inside yaml files and large files can be provisioned with regular dbt tooling practices. Since all of them have different approaches / tooling I would rather see a solution to unit testing where we can get a model's sql code (parsed) but where refs and sources (potentially variables as well) can be overridden in a test local scope. Assume such a macro exists (where we can get the compiled sql code) and that it is called Contrived example SELECT
*,
a+b AS sum
FROM {{ ref('some_other_model') }}
This example doesn't have ref/source overrides but rather does a simple string replace but you get the general idea. How the data is sourced in _ This is mostly a copy from this thread we're I've tried to get some feedback on this approach : https://discourse.getdbt.com/t/testing-with-fixed-data-set/564/9 _ |
hi @Zatte , I really like that approach. It definitely feels more DBT-onic than what I was proposing. As you say, there might be multiple ways of filling data for testing depending on the size of the tests. Nothing stops the yaml I was proposing to generate these test sqls automatically so it's not even like these two approaches are exclusive. |
I see the preference is to have the mock data in some file in the repo. I am curious as to why not have a different database / schema for the mock data e.g. raw_mock_data and replace sources with that maybe using a var and adding a tag to these tests so you can include/exclude them on a given run |
I personally would like all tests to be able to run using just
I think this approach can work in many situations but not all. If you can only swap out the schema then you are limited to swapping 1:1 between production/mock data. What if you want to test a model using different mocks and/or which depends on 2 or more tables (let's call them A, B); Testing with mocks A1, A2, B1, B2, B3 and combinations of these would be difficult. |
makes sense, thanks for clarifying. |
Hi I'm doing unit tests in dbt with a couple of custom macro helpers, with a couple of trade-offs and not practical things.
It looks good:
But it has a couple of flaws:
So it's far from a perfect setup. I was having a look at Dataform and they have the concept of unit tests as a feature. In Dataform, for each unit test, we need to defined the model that we want to test (as in my approach) and we need to always provide the input data for each model used by the model_to_test. Rewriting my initial test in Dataform looks like this:
That's neat imo. That being said, I was impressed with the Dataform approach, and I think an approach like that is the way to go for dbt. I think this is slightly hard to put on a PR by an 'outsider'. Could someone from the dbt team please share what is the road-map for unit tests? |
Just chiming in to link a solid slack thread, prompted by the comment above. This is a topic I'm very interested in — and would be interested in revisiting, in earnest, next year |
Hi.
The unit test is composed by 4 separated parts:
Under the hood the unit_test macro constructs a big sql query which doesn't depend on models, just depends on the inputs. That being said we don't need to make dbt run, to refresh the models each time we want to test a new change, so the feedback loop is seconds. We solved our main problem which was mocking the sources of a model, also we improved the feedback loop, anyway, we still have a couple of ideas to improve based on our customers feedback:
We'll share the custom macros in Equal Experts GitHub. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Describe the feature
In addition to the existing data test support DBT provides, it would be great if users had the capability to write unit tests to assert model behaviour generally and in edge cases. These would validate expected behaviour of models for data that isn't yet observed in production.
To do this, DBT would need to provide the ability to run models on a set of static inputs which could either be created at query-time or ahead of time.
We prototyped a solution where users encode static input in CSV files, configure a 'tests.yml' file that provides mappings between
source
/ref
models and CSV files, as well as specifying an expected output (also encoded as a CSV file). Our framework then generated a query that created a CTE for each static input, a CTE that represented the model being tested (replacing source/ref macros with the static input CTE names), and lastly ran a diff between the expected model and the model generated using static inputs. This generated query was then fed todbt test
- if the diff returned 0 results, the test would pass.Feedback from data scientists was that encoding static inputs in CSV files was cumbersome, readability of tests was poor because of the many disparate files representing a test case, and flexibility to programmatically encode static inputs and write custom expectations beyond equality was also desired.
Wondering if other DBT users have tried to achieve something similar, and how the community feels it's best to approach unit testing in DBT.
Describe alternatives you've considered
We have considered running DBT's built-in data tests and running them on a small sample of production data locally. However, creating a representative sample of data for all edge cases for all downstream models is a challenging task and also bad practice - unit tests should have a single reason to fail. Creating many small tables representing individual test cases could be done to counter this but our main concern was where/how these static datasets were encoded - if they are in separate (let's say CSV) files, this creates a readability issue where reviewers / users have to jump between multiple files to understand a test case.
Another more general issue with this approach is that writing assertions for unit tests feels quite unnatural in SQL - its tricky even to get the right semantics for an equality check.
Additional context
There are definitely aspects of this that are database-specific. For example, in BigQuery, we can create static inputs as CTEs using ARRAYs of STRUCT types. For other databases, a different syntax or more preferred method of creating static data for testing. In addition, to create static inputs in BigQuery as ARRAYs of SRUCTs the data type of each column needs to be specified.
Who will this benefit?
I think all DBT users would benefit, especially large organizations where there will be frequent updates and many collaborators for a single model. Unit testing will give users more confidence that the changes they are making will not break existing behaviour.
The text was updated successfully, but these errors were encountered: