Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HDFS and S3 path as warehouse directory for Hive file catalog #24660

Conversation

hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Mar 2, 2025

Description

Hive file metastore is a file-based metastore used for testing or development purpose whose document is introduced in PR #24511. This PR enable File-based hive metastore to use HDFS/S3 locations as warehouse dir, as described in issue #19112.

An example configuration to use HDFS path includes:

    connector.name=iceberg
    iceberg.catalog.type=HIVE
    hive.metastore=file
    hive.metastore.catalog.dir=hdfs://hostaddr:9000/warehouse

Besides, by configuring the s3 properties described in https://prestodb.io/docs/current/connector/hive.html#amazon-s3-configuration, we can specify a S3 location as the warehouse dir of Hive file catalog. This way, both metadata and data
of iceberg tables will be maintained on S3 storage.

An example configuration to use S3 path includes:

    connector.name=iceberg
    iceberg.catalog.type=HIVE
    hive.metastore=file
    hive.metastore.catalog.dir=s3://iceberg_bucket/warehouse

    hive.s3.use-instance-credentials=false
    hive.s3.aws-access-key=accesskey
    hive.s3.aws-secret-key=secretkey
    hive.s3.endpoint=http://192.168.0.103:9878
    hive.s3.path-style-access=true

Motivation and Context

Support specifying a HDFS/S3 location directly as warehouse dir for file-based hive metastore

Impact

Lake houses configured with hive file metastore can now specify a HDFS/S3 location directly as the warehouse dir

Test Plan

  • Manually test IcebergDistributedTestBase, IcebergDistributedSmokeTestBase and TestIcebergDistributedQueries on Iceberg connector configured with Hive file catalog on local deployed MINIO object storage
  • Manually test IcebergDistributedTestBase, IcebergDistributedSmokeTestBase and TestIcebergDistributedQueries on Iceberg connector configured with Hive file catalog on local deployed HDFS

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Enable file-based hive metastore to use HDFS/S3 location as warehouse dir.

@hantangwangd hantangwangd force-pushed the support_hdfs_and_s3_for_hive_file_metastore branch 2 times, most recently from 6bca326 to f49a0f0 Compare March 2, 2025 13:59
@hantangwangd hantangwangd marked this pull request as ready for review March 2, 2025 16:12
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc update! Nits for readability suggested.

@hantangwangd hantangwangd force-pushed the support_hdfs_and_s3_for_hive_file_metastore branch from f49a0f0 to f3b07ae Compare March 3, 2025 17:24
@hantangwangd
Copy link
Member Author

Thank you @steveburnett for the suggestion, fixed! Please take a look when available.

steveburnett
steveburnett previously approved these changes Mar 3, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good. Thanks!

@@ -1617,6 +1625,25 @@ public void testRegisterTableWithFileName()
dropTable(getSession(), tableName);
}

protected HdfsEnvironment getHdfsEnvironment()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Ideally this should be after all public methods, but it was not fully complying before either. We can leave it as is now but it will be nice to move it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it has been moved to the bottom.

return IcebergDistributedTestBase.getHdfsEnvironment(hiveClientConfig, metastoreClientConfig, hiveS3Config);
}

public static String getMetadataFileLocation(ConnectorSession session, HdfsEnvironment hdfsEnvironment, String schema, String table, String metadataLocation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be private? It's only used above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out this, fixed! And I have moved this method to the bottom as well since it has been modified to private. Please take a look when convenient.

@hantangwangd hantangwangd force-pushed the support_hdfs_and_s3_for_hive_file_metastore branch from f3b07ae to 48a7b6d Compare March 6, 2025 09:04
@yingsu00 yingsu00 merged commit 348e991 into prestodb:master Mar 7, 2025
54 checks passed
@hantangwangd hantangwangd deleted the support_hdfs_and_s3_for_hive_file_metastore branch March 7, 2025 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants