feat: Direct iceberg table reading #5880

devinrsmith · 2024-08-01T00:51:02Z

This adds support to read iceberg tables directly from a specific metadata file without the need for a catalog (although, a catalog may be present).

At a minimum, this should be a very useful tool for debugging iceberg issues. In some cases, it may be the best way to read iceberg data as a catalog may not be supported. For example, clickhouse iceberg integration uses direct access without catalog support (no table name, no namespace, etc):

$ aws s3 ls --recursive --no-sign-request s3://datasets-documentation/ookla/iceberg/
2024-01-22 07:46:40          0 ookla/iceberg/
2024-01-22 08:48:38  611058150 ookla/iceberg/data/7XNeNQ/year_month_year=2019/20240122_164644_00156_m96dt-a29b72df-0432-46db-8194-7cb911f08800.parquet
2024-01-22 08:47:14  756200550 ookla/iceberg/data/87-7xw/year_month_year=2020/20240122_164644_00156_m96dt-63c79f23-dd64-4a7e-890b-700b722b5d03.parquet
2024-01-22 08:47:07  767259012 ookla/iceberg/data/Mmyt8A/year_month_year=2021/20240122_164644_00156_m96dt-bb51598a-6f86-4d19-9717-bcef163a4f05.parquet
2024-01-22 08:47:01  781589111 ookla/iceberg/data/X9Wyog/year_month_year=2022/20240122_164644_00156_m96dt-d27f1b84-bb50-4205-b323-990a32e18ff6.parquet
2024-01-22 08:47:01  836014231 ookla/iceberg/data/wRhLaA/year_month_year=2023/20240122_164644_00156_m96dt-e5ce13ef-f40c-4834-b032-39e403121d3c.parquet
2024-01-22 08:25:30       1968 ookla/iceberg/metadata/00000-6bfbd5a5-c431-4a41-98c8-12328da25947.metadata.json
2024-01-22 08:49:55       3107 ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json
2024-01-22 08:49:53       8347 ookla/iceberg/metadata/a3a81488-f4ec-42ad-9819-54527e7f6385-m0.avro
2024-01-22 08:49:54       4280 ookla/iceberg/metadata/snap-8326954415243093563-1-a3a81488-f4ec-42ad-9819-54527e7f6385.avro

SELECT
  *
FROM
  iceberg('https://datasets-documentation.s3.eu-west-3.amazonaws.com/ookla/iceberg/')

https://clickhouse.com/blog/exploring-global-internet-speeds-with-apache-iceberg-clickhouse
https://clickhouse.com/docs/en/sql-reference/table-functions/iceberg

With this PR, the equivalent in Deephaven would be:

from deephaven.experimental import iceberg

ookla = iceberg.read_static_table(
    "s3://datasets-documentation/ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json"
)

(For ease of use, the bit more verbose version works out-of-the-box without relying on implicit AWS credentials:

from deephaven.experimental import iceberg, s3
from datetime import timedelta

ookla = iceberg.read_static_table(
    "s3://datasets-documentation/ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json",
    instructions=iceberg.IcebergInstructions(
        data_instructions=s3.S3Instructions(
            region_name="eu-west-3",
            anonymous_access=True,
            read_timeout=timedelta(seconds=10),
        )
    ),
)

)

There's potential to extend this support to point to the root of the table location (like clickhouse supports) as opposed to a specific metadata file, ie, s3://datasets-documentation/ookla/iceberg/, but that would take some additional logic.

devinrsmith · 2024-08-01T00:51:36Z

This is missing documentation, as I want to make sure there's some agreement on the interfaces before proceeding.

lbooker42

This is very interesting and would be a great and fast way to load static Iceberg tables.

lbooker42 · 2024-08-01T01:09:43Z

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergCatalogAdapter.java

+            @NotNull final Schema schema,
+            @NotNull final org.apache.iceberg.Table table,
+            @NotNull final IcebergInstructions userInstructions) {
+        return TableTools.newTable(tableDefinition(schema, table, userInstructions, -1));


This works for static tables, but for refreshing tables I believe we'll need to return PartitionAwareSourceTable with zero partitions (or the IcebergTable equivalent) in order to populate data with discovered files.

lbooker42 · 2024-08-01T01:11:11Z

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTools.java

+        // final HadoopTables tables = new HadoopTables(hadoopConf);
+        // final org.apache.iceberg.Table table = tables.load(uri);


Leftover debugging code?

Suggested change

// final HadoopTables tables = new HadoopTables(hadoopConf);

// final org.apache.iceberg.Table table = tables.load(uri);

lbooker42 · 2024-08-01T01:14:41Z

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTools.java

+            IcebergInstructions instructions,
+            Map<String, String> properties,
+            Configuration hadoopConf) {
+        // final HadoopTables tables = new HadoopTables(hadoopConf);


L/O debug code...

lbooker42 · 2024-08-01T01:18:19Z

.../iceberg/s3/src/main/java/io/deephaven/extensions/s3/CredentialsPropertyAdapterInternal.java

@@ -0,0 +1,29 @@
+//


These adapters are interesting, in #5754 we are creating data instruction objects (S3Instructions etc.) from the properties maps. This is doing the inverse, right?

devinrsmith · 2024-08-01T01:52:27Z

This is partially related to #5868, at least for providing a refactoring of the TableDefinition logic and exposing it to end users for the static entrypoints.

Direct iceberg table support, no catalog

77633f8

devinrsmith added the iceberg label Aug 1, 2024

devinrsmith added this to the 0.36.0 milestone Aug 1, 2024

devinrsmith requested a review from lbooker42 August 1, 2024 00:51

devinrsmith self-assigned this Aug 1, 2024

devinrsmith changed the title ~~Direct iceberg table reading~~ feat: Direct iceberg table reading Aug 1, 2024

devinrsmith added DocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Aug 1, 2024

lbooker42 reviewed Aug 1, 2024

View reviewed changes

lbooker42 mentioned this pull request Aug 1, 2024

feat: provide TableDefinition functions for Iceberg tables #5881

Closed

devinrsmith mentioned this pull request Aug 1, 2024

feat: add generic Iceberg catalog adapter creation to Java / Python #5754

Merged

pete-petey modified the milestones: 0.36.0, 0.37.0 Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Direct iceberg table reading #5880

feat: Direct iceberg table reading #5880

devinrsmith commented Aug 1, 2024

devinrsmith commented Aug 1, 2024

lbooker42 left a comment

lbooker42 Aug 1, 2024

lbooker42 Aug 1, 2024

lbooker42 Aug 1, 2024

lbooker42 Aug 1, 2024

devinrsmith commented Aug 1, 2024

		// final HadoopTables tables = new HadoopTables(hadoopConf);
		// final org.apache.iceberg.Table table = tables.load(uri);

feat: Direct iceberg table reading #5880

Are you sure you want to change the base?

feat: Direct iceberg table reading #5880

Conversation

devinrsmith commented Aug 1, 2024

devinrsmith commented Aug 1, 2024

lbooker42 left a comment

Choose a reason for hiding this comment

lbooker42 Aug 1, 2024

Choose a reason for hiding this comment

lbooker42 Aug 1, 2024

Choose a reason for hiding this comment

lbooker42 Aug 1, 2024

Choose a reason for hiding this comment

lbooker42 Aug 1, 2024

Choose a reason for hiding this comment

devinrsmith commented Aug 1, 2024