Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Non-Deterministic Source Queries and Data Changing Sources #538

Open
osopardo1 opened this issue Jan 28, 2025 · 0 comments
Open

Comments

@osopardo1
Copy link
Member

osopardo1 commented Jan 28, 2025

As a first solution for #466, we need to force users to add the columnStats when indexing Tables with the following characteristics:

  • Underlying data source changes constantly.
  • DataFrame contains non-deterministic columns to index.
  • DataFrame contains non-deterministic predicates.

There are different solutions for the process to succeed:

  1. Add columnStats if you are using a default/linear transformation. The usage of columnStats would infer the data's min/max values before the DataFrame Analysis, which can produce inconsistent results when loading the DataFrame twice for Indexing in any of the above use cases.
  2. For versions packaged after main, you can change the transformation type for the columns indexed to quantiles, which is more flexible than the default/linear transformation. (Not bounded by min/max, safe to write).
  3. Materialize the data frame before writing to Qbeast. Either in memory if it is a small piece of data, or in the file system.

This procedure should be documented in some sort of FAQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant