-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Flint serverless skipping index #118
Comments
Proposal: Flint Serverless Skipping Index BuildingDesign Options
Next we will dive into the Option 3 Query time on-demand build. ExampleFollowing the item proposed above, here is an example that illustrates the idea: T-1: Query timestamp after 2023-05-01
T-2: Query timestamp after 2023-04-30 with new files 2023-05-05, 2023-05-06
Implementation Challenges
|
Design: Flint Serverless Skipping Index StorageTODO |
Design: Automatic Skipping Algorithm SelectionTODO ReferenceDelta table column statsDelta table collects column stats automatically. However, it only collects min-max for numerical, date and string column. Probably because it stores data as Parquet (which uses min-max, dictionary encoding and bloom filter already), Delta only aggregate file-level min-max to Delta table level.
Hyperspace Analysis UtilityHyperspace also provides analysis utility to help users estimate the effectiveness of Z-Ordering before creation.
|
Is your feature request related to a problem?
When using Flint skipping index, user first needs to create a Spark table and then has to decide what skipping data structure for a column. Afterwards, the freshness of skipping index is maintained by a long running Spark streaming job.
As a user, the pain point includes:
What solution would you like?
Propose idea below and need PoC for each item:
a. Develop a component that analyzes column characteristics like data type, size, and cardinality, and automatically selects the most suitable skipping algorithm
b. Implement a user-friendly option to enable or disable this feature so user only decides to enable or not (like Snowflake)
a. Rewrite the query plan to wrap a scan operator that collects skipping index data on demand
b. Apply similar mechanism to the current hybrid scan mode, where new files trigger skipping data collection as necessary
a. Skipping data are essentially aggregated data structure and not necessarily rely on Lucene
b. Tiering the skipping index storage (like Apache Iceberg) and maintain hot tier-1 data in OpenSearch index
c. Or write Flint index format during ingestion directly
The text was updated successfully, but these errors were encountered: