Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Iceberg query performance with covering index support #325

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Apr 30, 2024

Description

PR #318 disabled covering index IT for Iceberg temporarily. This PR adds the missing support and re-enable the IT.

PR Planned

Changes

Added new FlintSparkSourceRelation and FlintSparkSourceRelationProvider abstraction. See Scala doc for its responsibility. Will refactor ApplyFlintSparkSkippingIndex and FlintSparkValidationHelper.isTableProviderSupported based on these later.

Screenshot 2024-04-30 at 11 55 56 AM

Testing

Spark Table

spark-sql> CREATE INDEX all ON myglue.ds_tables.http_logs
         > (
         >   `@timestamp`,
         >   clientip,
         >   request,
         >   status,
         >   size
         > );

scala> sc.setLogLevel("INFO")
scala> sql("EXPLAIN SELECT clientip FROM myglue.ds_tables.http_logs WHERE status != 200").show

# Logging explains whether and why the index is applied
24/05/03 17:51:17 INFO FlintSparkSourceRelationProvider: Loaded source relation providers [file]
24/05/03 17:51:17 INFO ApplyFlintSparkCoveringIndex: Provider [file] can match sub plan LogicalRelation
24/05/03 17:51:18 INFO ApplyFlintSparkCoveringIndex: Found covering index 
[flint_myglue_ds_tables_http_logs_all_index] on table myglue.ds_tables.http_logs
24/05/03 17:51:18 INFO ApplyFlintSparkCoveringIndex:
 Is covering index flint_myglue_ds_tables_http_logs_all_index applicable: true
   Index state: Some(active)
   Index filter condition: None
   Columns required: Set(clientip, status)
   Columns indexed: Set(@timestamp, request, size, clientip, status)

Iceberg Table

With Iceberg runtime, create Iceberg table and covering index successfully:

$ spark-sql --jars /home/hadoop/flint-spark-integration-assembly-0.4.0-SNAPSHOT.jar,/home/hadoop/sql-job-assembly-0.1.0-SNAPSHOT.jar,/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.opensearch.flint.spark.FlintSparkExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.spark_catalog.warehouse=s3://.../iceberg-warehouse \
...

CREATE DATABASE myglue.iceberg;

spark-sql> CREATE TABLE myglue.iceberg.http_logs_iceberg (
         >   `@timestamp` TIMESTAMP,
         >   clientip STRING,
         >   request STRING,
         >   status INT,
         >   size INT
         > )
         > USING iceberg
         > LOCATION 's3://daichen-benchmark/iceberg-warehouse/httplogs_iceberg';
Time taken: 1.688 seconds

Test query acceleration with covering index:

spark-sql> EXPLAIN SELECT clientip FROM myglue.iceberg.http_logs_iceberg WHERE status != 200;
== Physical Plan ==
*(1) Project [clientip#39]
+- BatchScan[@timestamp#38, request#40, size#42, clientip#39, status#41]
class org.apache.spark.sql.flint.FlintScan, PushedPredicates: [status IS NOT NULL, NOT (status = 200)]
RuntimeFilters: []

Issues Resolved

#298

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen added feature New feature 0.4 labels Apr 30, 2024
@dai-chen dai-chen self-assigned this Apr 30, 2024
@dai-chen dai-chen changed the title Support covering index acceleration for Iceberg query Enhance Iceberg query performance with covering index support Apr 30, 2024
dai-chen added 4 commits May 1, 2024 11:04
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen marked this pull request as ready for review May 3, 2024 23:16
@dai-chen dai-chen added 0.5 and removed 0.4 labels May 20, 2024
@dai-chen
Copy link
Collaborator Author

Will publish new PR with refactoring changes unrelated to Iceberg alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.5 feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant