-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Direct iceberg table reading #5880
base: main
Are you sure you want to change the base?
Conversation
This is missing documentation, as I want to make sure there's some agreement on the interfaces before proceeding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very interesting and would be a great and fast way to load static Iceberg tables.
@NotNull final Schema schema, | ||
@NotNull final org.apache.iceberg.Table table, | ||
@NotNull final IcebergInstructions userInstructions) { | ||
return TableTools.newTable(tableDefinition(schema, table, userInstructions, -1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for static tables, but for refreshing tables I believe we'll need to return PartitionAwareSourceTable
with zero partitions (or the IcebergTable
equivalent) in order to populate data with discovered files.
// final HadoopTables tables = new HadoopTables(hadoopConf); | ||
// final org.apache.iceberg.Table table = tables.load(uri); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover debugging code?
// final HadoopTables tables = new HadoopTables(hadoopConf); | |
// final org.apache.iceberg.Table table = tables.load(uri); |
IcebergInstructions instructions, | ||
Map<String, String> properties, | ||
Configuration hadoopConf) { | ||
// final HadoopTables tables = new HadoopTables(hadoopConf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L/O debug code...
@@ -0,0 +1,29 @@ | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These adapters are interesting, in #5754 we are creating data instruction objects (S3Instructions
etc.) from the properties maps. This is doing the inverse, right?
This is partially related to #5868, at least for providing a refactoring of the TableDefinition logic and exposing it to end users for the static entrypoints. |
This adds support to read iceberg tables directly from a specific metadata file without the need for a catalog (although, a catalog may be present).
At a minimum, this should be a very useful tool for debugging iceberg issues. In some cases, it may be the best way to read iceberg data as a catalog may not be supported. For example, clickhouse iceberg integration uses direct access without catalog support (no table name, no namespace, etc):
https://clickhouse.com/blog/exploring-global-internet-speeds-with-apache-iceberg-clickhouse
https://clickhouse.com/docs/en/sql-reference/table-functions/iceberg
With this PR, the equivalent in Deephaven would be:
(For ease of use, the bit more verbose version works out-of-the-box without relying on implicit AWS credentials:
)
There's potential to extend this support to point to the root of the table location (like clickhouse supports) as opposed to a specific metadata file, ie,
s3://datasets-documentation/ookla/iceberg/
, but that would take some additional logic.