Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query rewrite for LogsTable skipping index #154

Merged
merged 2 commits into from
Nov 15, 2023

Conversation

seankao-az
Copy link
Collaborator

@seankao-az seankao-az commented Nov 14, 2023

Description

Query rewrite for LogsTable skipping index

Build skipping index

Process for building skipping index is unchanged due to compatibility of current method with LogsTable.

Query rewrite for skipping index

On query plan optimization time, we construct a new LogsTable with the DataFrame to fetch log file ids from skipping index. These log file ids are then used to build the scan operator.

Dependency

Added a compile time dependency which contains only the interface of LogsTable.

Test

Test is not possible without loading LogsConnectorSpark fat jar as dependency. Instead, manual integration test is done locally.

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sean Kao <seankao@amazon.com>
@seankao-az seankao-az changed the title query rewrite for nexus skipping index query rewrite for cloudwatch logs skipping index Nov 14, 2023
indexScan
.filter(new Column(indexFilter.get))
.select(FILE_PATH_COLUMN)
.collect
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect is reduce operation, @dai-chen could you help sean fix this

Copy link
Collaborator

@dai-chen dai-chen Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Sean that this requires changes in LogsTable.

Similar as Flint FileIndex implementation:

  1. It accepts indexScan DataFrame instead of result Set
  2. It triggers the data frame collect at execution time

@seankao-az correct me if I understood wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Need changes in LogsTable side. Right now LogsTable accepts a list of file ids. Should let it accept DataFrame instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Integration tested locally together with changes from dependency package.

@seankao-az seankao-az changed the title query rewrite for cloudwatch logs skipping index query rewrite for LogsTable skipping index Nov 14, 2023
@dai-chen dai-chen added the enhancement New feature or request label Nov 14, 2023
Signed-off-by: Sean Kao <seankao@amazon.com>
@penghuo penghuo merged commit 0351f40 into opensearch-project:main Nov 15, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants