[WIP] Add timeseries / observability benchmarks #9017

alamb · 2024-01-27T09:39:37Z

Which issue does this PR close?

Rationale for this change

There are usecases for several DataFusion users (like IOx) that store observability data, that is often characterized by low cardinality string data encoded as dictionaries. While the current parquet_filter pushdown benchmarks (TODO LINK) cover this example, we don't have an end to end test that does.

This has caused problems when have made changes such as #7647 that should improve the performance of these queries but we had no reproducible way to measure the impact, and couldn't evaluate if the change was beneficial enough to warrant additional code complexity

There in systems such as IOx the data is very often sorted and the sort order is quite important for performance. However, DataFusion's existing benchmark coverage does not have any pre-sorted data

What changes are included in this PR?

Add a datafusion specific data set to to model common patterns in timeseries data -- http access logs / metrics and tracing data specifically. This uses the same generator as used in several other parts of DataFusion
Add a XXX benchmark to dfbench, runnable by bench.sh along with several queries

Are these changes tested?

All tests

Are there any user-facing changes?

No

TODO

add ticket / extend to model logging data as well

github-actions · 2024-04-13T01:40:08Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

alamb added 8 commits January 26, 2024 07:31

Add observability_bench

efa34a6

checkpoint

4087cec

work in

38f0119

use dataframe api

64c1c98

Updates

b15de28

parallelize

b96661f

update

d14c224

tweak

b6e064e

github-actions bot added the Stale PR has not had any activity for some time label Apr 13, 2024

github-actions bot closed this Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add timeseries / observability benchmarks #9017

[WIP] Add timeseries / observability benchmarks #9017

alamb commented Jan 27, 2024

github-actions bot commented Apr 13, 2024

[WIP] Add timeseries / observability benchmarks #9017

[WIP] Add timeseries / observability benchmarks #9017

Conversation

alamb commented Jan 27, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODO

github-actions bot commented Apr 13, 2024