Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Use random sampler for aggregations for Data Visualizer document count chart #136124

Closed
qn895 opened this issue Jul 11, 2022 · 2 comments
Closed
Assignees
Labels
Feature:File and Index Data Viz ML file and index data visualizer :ml v8.4.0

Comments

@qn895
Copy link
Member

qn895 commented Jul 11, 2022

Describe the feature:

Currently, in ML's Data visualizer, we are not using any sampling when making aggregations for the document count chart. For large indices, it would benefit greatly for us to enable the random sampler agg when appropriate. To enable the new sampling method, it needs to:

  • Account whether the random sampler would be appropriate to use (e.g. only use it with indices/queries with more than 1 million hits only) to ensure a sufficient sampled size.
  • Account for speed improvement over vanilla aggregation without sampling (e.g. to opt for vanilla aggregation for queries with less than 10 million docs).
  • Clearly indicate that the total document count as well as the chart itself is approximate if random sampler is used
  • Reconcile the difference in the populated %. Currently we use the total hit count/total document count to calculate the % of docs in which a field is populated.

Proposed approach:

  • First make a query with a low default probability of 0.0001 - from this initial result (which averages around 120ms), find the estimate number of total docs and calculate the next appropriate probability.
  • If estimated number of total docs < 10 million docs*, then use probability of 1 (which is to not use sampling at all)
  • If estimated number of total docs >= 10 million docs*, then use the calculated closest probability. Then show this value in the probability slider and visually indicate that we are indeed using random sampling.
@qn895 qn895 added :ml Feature:File and Index Data Viz ML file and index data visualizer v8.4.0 labels Jul 11, 2022
@qn895 qn895 self-assigned this Jul 11, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

@qn895 qn895 changed the title [ML] Use random sampler for aggregations for Data visualizer document count chart [ML] Use random sampler for aggregations for Data Visualizer document count chart Jul 11, 2022
@qn895
Copy link
Member Author

qn895 commented Jul 27, 2022

Closing via #136150

@qn895 qn895 closed this as completed Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:File and Index Data Viz ML file and index data visualizer :ml v8.4.0
Projects
None yet
Development

No branches or pull requests

2 participants