-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query rewrite for LogsTable skipping index #154
Conversation
Signed-off-by: Sean Kao <seankao@amazon.com>
indexScan | ||
.filter(new Column(indexFilter.get)) | ||
.select(FILE_PATH_COLUMN) | ||
.collect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collect is reduce operation, @dai-chen could you help sean fix this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with Sean that this requires changes in LogsTable
.
Similar as Flint FileIndex implementation:
- It accepts indexScan
DataFrame
instead of resultSet
- It triggers the data frame
collect
at execution time
@seankao-az correct me if I understood wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Need changes in LogsTable
side. Right now LogsTable
accepts a list of file ids. Should let it accept DataFrame
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Integration tested locally together with changes from dependency package.
...ration/src/main/scala/org/opensearch/flint/spark/skipping/ApplyFlintSparkSkippingIndex.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Sean Kao <seankao@amazon.com>
2c33440
to
deb5fe8
Compare
Description
Query rewrite for LogsTable skipping index
Build skipping index
Process for building skipping index is unchanged due to compatibility of current method with LogsTable.
Query rewrite for skipping index
On query plan optimization time, we construct a new LogsTable with the DataFrame to fetch log file ids from skipping index. These log file ids are then used to build the scan operator.
Dependency
Added a compile time dependency which contains only the interface of LogsTable.
Test
Test is not possible without loading LogsConnectorSpark fat jar as dependency. Instead, manual integration test is done locally.
Issues Resolved
List any issues this PR will resolve, e.g. Closes [...].
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.