Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BloomFilter skipping index building logic #242

Merged

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Feb 2, 2024

Description

This is the first PR for BloomFilter skipping index support. This PR is focus on bloom filter building side and introduced core classes as below. Please read #206 for big picture, including final user experience, design decision, proof of concept and benchmark.

PR Planned

  1. Implement BloomFilter skipping index building logic #242 [Current]
  2. Implement query rewrite logic without pushdown
  3. Implement pushdown optimization by OpenSearch painless script
  4. Implement AdaptiveBloomFilter algorithm
  5. Support bloom filter type in Flint SQL

Documentation

Updated user manual: https://github.com/dai-chen/opensearch-spark/blob/add-bloom-filter-building-logic/docs/index.md#feature-highlights

Class Diagram

Screenshot 2024-02-06 at 9 42 34 AM

Issues Resolved

#206

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen added enhancement New feature or request 0.2 labels Feb 2, 2024
@dai-chen dai-chen self-assigned this Feb 2, 2024
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen marked this pull request as ready for review February 6, 2024 18:03
* Bloom filter interface inspired by [[org.apache.spark.util.sketch.BloomFilter]] but adapts to
* Flint index use and remove unnecessary API.
*/
public interface BloomFilter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, Aren't there any opensource library implementations we can leverage.
is this because we need custom serialization to write to opensearch?

Copy link
Collaborator Author

@dai-chen dai-chen Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed as the Javadoc, we're porting Spark's built-in BloomFilter, BitArray, Murmur3_x86_32 to our flint-core library. We only maintain the minimal API as needed and implement them within flint-core, considering the future possibility below:

  1. Integration with other query engine: We can implement bloom filter index in other query engine with the flint-core library
  2. User creates Flint index using our library in ingestion pipeline: We can add BloomFilter field type and user can generate Flint index at ingestion time

@dai-chen dai-chen merged commit b1d132e into opensearch-project:main Feb 7, 2024
4 checks passed
@dai-chen dai-chen deleted the add-bloom-filter-building-logic branch February 7, 2024 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants