[WIP] Add timeseries / observability benchmarks #9017
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #8791
Rationale for this change
There are usecases for several DataFusion users (like IOx) that store observability data, that is often characterized by low cardinality string data encoded as dictionaries. While the current parquet_filter pushdown benchmarks (TODO LINK) cover this example, we don't have an end to end test that does.
This has caused problems when have made changes such as #7647 that should improve the performance of these queries but we had no reproducible way to measure the impact, and couldn't evaluate if the change was beneficial enough to warrant additional code complexity
There in systems such as IOx the data is very often sorted and the sort order is quite important for performance. However, DataFusion's existing benchmark coverage does not have any pre-sorted data
What changes are included in this PR?
bench.sh
along with several queriesAre these changes tested?
All tests
Are there any user-facing changes?
No
TODO