[RFC] Flint serverless skipping index #118

dai-chen · 2023-10-31T16:28:50Z

Is your feature request related to a problem?

When using Flint skipping index, user first needs to create a Spark table and then has to decide what skipping data structure for a column. Afterwards, the freshness of skipping index is maintained by a long running Spark streaming job.

As a user, the pain point includes:

Create table: difficulty to figure out what table to create due to semi-structured and deep nested characteristic in log data
Choose skipping data structure: expertise required for user to figure out which skipping algorithm fits the user data best
Spark streaming job maintenance: extra maintenance efforts and cost to keep streaming job running even if source log data doesn't change often

What solution would you like?

Propose idea below and need PoC for each item:

Automatic Skipping Algorithm Selection
a. Develop a component that analyzes column characteristics like data type, size, and cardinality, and automatically selects the most suitable skipping algorithm
b. Implement a user-friendly option to enable or disable this feature so user only decides to enable or not (like Snowflake)
Serverless Skipping Index Building
a. Rewrite the query plan to wrap a scan operator that collects skipping index data on demand
b. Apply similar mechanism to the current hybrid scan mode, where new files trigger skipping data collection as necessary
Serverless Skipping Index Storage
a. Skipping data are essentially aggregated data structure and not necessarily rely on Lucene
b. Tiering the skipping index storage (like Apache Iceberg) and maintain hot tier-1 data in OpenSearch index
c. Or write Flint index format during ingestion directly

dai-chen · 2023-11-27T19:53:07Z

Proposal: Flint Serverless Skipping Index Building

Design Options

Offline build by Spark streaming job (current)
a. Pros: user doesn’t needs to change ingestion
b. Cons: long running job maintenance and cost
Ingestion time build
a. Pros: index data always sync with source
b. Cons: user needs to change ingestion pipeline; index all fields in log data
Query time on-demand build
a. Pros: serverless and amortize computation cost to each query
b. Cons: need warm up and no guarantee for query latency if no obvious query pattern

Next we will dive into the Option 3 Query time on-demand build.

Example

Following the item proposed above, here is an example that illustrates the idea:

T-1: Query timestamp after 2023-05-01

[FlintOptimizer] fetch source file list and query skipping index
[SkippingIndex] return empty result
[FlintOptimizer] nothing to skip
[FlintOptimizer] append new metadata log entry with column and file list to index
[FlintOptimizer] determine index algorithm and rewrite query plan with Flint collect and store operator
[FlintCollect] collect unique values of the column (@message.params.0.instanceId)
[FlintStore] store index data and append metadata log entry

T-2: Query timestamp after 2023-04-30 with new files 2023-05-05, 2023-05-06

[FlintOptimizer] fetch source file list and query skipping index
[SkippingIndex] return file to scan list [2023-05-02] for range [2023-05-01, 2023-05-04]
[FlintOptimizer] file to scan after skip [2023-04-30, 2023-05-02, 2023-05-05, 2023-05-06]
[FlintOptimizer] append new metadata log entry with column and file list to index
[FlintOptimizer] determine index algorithm and rewrite query plan with Flint collect and store operator
[FlintCollect] collect unique values of the column (@message.params.0.instanceId)
[FlintStore] store index data and append metadata log entry

Implementation Challenges

Flint Query Optimizer
a. Automatic algorithm selection: without column stats, choose skipping algorithm in heuristics way
b. Concurrency control: simply not use skipping index while it's refreshing or parallel update it (treat different column as physical partition)
Flint Query Operator
a. Flint collect: Spark splits source file to chunk as worker task
b. Flint store: separate communication channel with Flint collect operator
Flint Skipping Index
a. Metadata: store valid range (maybe discrete) for index data
b. Index data: optimize/evolve data structure over the time, ex. from value set to bloom filter as more and more unique values indexed

dai-chen · 2023-11-28T18:18:01Z

Design: Flint Serverless Skipping Index Storage

TODO

dai-chen · 2023-12-18T17:47:23Z

Design: Automatic Skipping Algorithm Selection

TODO

Reference

Delta table column stats

Delta table collects column stats automatically. However, it only collects min-max for numerical, date and string column. Probably because it stores data as Parquet (which uses min-max, dictionary encoding and bloom filter already), Delta only aggregate file-level min-max to Delta table level.

Code: https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala#L252

      collectStats(MIN, statCollectionPhysicalSchema) {
        // Truncate string min values as necessary
        case (c, SkippingEligibleDataType(StringType), true) =>
          substring(min(c), 0, stringPrefix)

        // Collect all numeric min values
        case (c, SkippingEligibleDataType(_), true) =>
          min(c)
      },

object SkippingEligibleDataType {
  // Call this directly, e.g. `SkippingEligibleDataType(dataType)`
  def apply(dataType: DataType): Boolean = dataType match {
    case _: NumericType | DateType | TimestampType | StringType => true
    case _ => false
  }

Hyperspace Analysis Utility

Hyperspace also provides analysis utility to help users estimate the effectiveness of Z-Ordering before creation.

scala> import com.microsoft.hyperspace.util.MinMaxAnalysisUtil
import com.microsoft.hyperspace.util.MinMaxAnalysisUtil

scala> println(MinMaxAnalysisUtil.analyze(df, Seq("A", "B"), format = "text"))
Min/Max analysis on A

                  < Number of files (%) >
     +--------------------------------------------------+
100% |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
 75% |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
 50% |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
 25% |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
     |**************************************************|
  0% |**************************************************|
     +--------------------------------------------------+
     Min <-----             A value            -----> Max

min(A): 0
max(A): 99
Total num of files: 12
Total byte size of files: 11683
Max. num of files for a point lookup: 1 (8.33%)
Estimated average num of files for a point lookup: 1.00 (8.33%)
Max. bytes to read for a point lookup: 982 (8.41%)
...

X-axis: it represents the range group of the column values.
Y-axis: the percentage of number of files to look up a value based on the minimum and maximum value of each file. So lower percentage means better distribution as we could skip more files.

dai-chen added enhancement New feature or request 0.2 labels Oct 31, 2023

github-actions bot added the untriaged label Oct 31, 2023

dai-chen removed the untriaged label Oct 31, 2023

dai-chen mentioned this issue Oct 31, 2023

[Feature] OpenSearch and Apache Spark Integration #3

Closed

dai-chen added this to OpenSearch Spark Project Planning Oct 31, 2023

dai-chen mentioned this issue Nov 13, 2023

[FEATURE] Incremental refresh index on Hive source table #91

Closed

dai-chen changed the title ~~[FEATURE] Flint skipping index phase 2~~ [RFC] [FEATURE] Flint skipping index phase 2 Nov 28, 2023

dai-chen changed the title ~~[RFC] [FEATURE] Flint skipping index phase 2~~ [RFC] [FEATURE] Flint serverless skipping index Dec 13, 2023

dai-chen mentioned this issue Dec 13, 2023

[FEATURE] Enrich skipping index data structure #193

Open

dai-chen added feature New feature and removed enhancement New feature or request labels Dec 13, 2023

dai-chen mentioned this issue Dec 13, 2023

[EPIC] Zero-ETL - Spark streaming job computation cost reduction #196

Open

dai-chen changed the title ~~[RFC] [FEATURE] Flint serverless skipping index~~ [RFC] Flint serverless skipping index Jan 11, 2024

dai-chen removed the 0.2 label Mar 7, 2024

dai-chen mentioned this issue Jun 3, 2024

[FEATURE] Performance and Scalability Enhancements for Flint Index #365

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Flint serverless skipping index #118

[RFC] Flint serverless skipping index #118

dai-chen commented Oct 31, 2023 •

edited

Loading

dai-chen commented Nov 27, 2023 •

edited

Loading

dai-chen commented Nov 28, 2023

dai-chen commented Dec 18, 2023 •

edited

Loading

[RFC] Flint serverless skipping index #118

[RFC] Flint serverless skipping index #118

Comments

dai-chen commented Oct 31, 2023 • edited Loading

dai-chen commented Nov 27, 2023 • edited Loading

Proposal: Flint Serverless Skipping Index Building

Design Options

Example

T-1: Query timestamp after 2023-05-01

T-2: Query timestamp after 2023-04-30 with new files 2023-05-05, 2023-05-06

Implementation Challenges

dai-chen commented Nov 28, 2023

Design: Flint Serverless Skipping Index Storage

dai-chen commented Dec 18, 2023 • edited Loading

Design: Automatic Skipping Algorithm Selection

Reference

Delta table column stats

Hyperspace Analysis Utility

dai-chen commented Oct 31, 2023 •

edited

Loading

dai-chen commented Nov 27, 2023 •

edited

Loading

dai-chen commented Dec 18, 2023 •

edited

Loading