Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add timeseries / observability benchmarks #9017

Closed
wants to merge 8 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 27, 2024

Which issue does this PR close?

Closes #8791

Rationale for this change

There are usecases for several DataFusion users (like IOx) that store observability data, that is often characterized by low cardinality string data encoded as dictionaries. While the current parquet_filter pushdown benchmarks (TODO LINK) cover this example, we don't have an end to end test that does.

This has caused problems when have made changes such as #7647 that should improve the performance of these queries but we had no reproducible way to measure the impact, and couldn't evaluate if the change was beneficial enough to warrant additional code complexity

There in systems such as IOx the data is very often sorted and the sort order is quite important for performance. However, DataFusion's existing benchmark coverage does not have any pre-sorted data

What changes are included in this PR?

  1. Add a datafusion specific data set to to model common patterns in timeseries data -- http access logs / metrics and tracing data specifically. This uses the same generator as used in several other parts of DataFusion
  2. Add a XXX benchmark to dfbench, runnable by bench.sh along with several queries

Are these changes tested?

All tests

Are there any user-facing changes?

No

TODO

  • add ticket / extend to model logging data as well

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 13, 2024
@github-actions github-actions bot closed this Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add test coverage for grouping on dictionary encoded columns
1 participant