-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg] Add table and session property for split size #24417
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -357,9 +357,9 @@ connector using a WITH clause: | |
|
||
The following table properties are available, which are specific to the Presto Iceberg connector: | ||
|
||
======================================= =============================================================== ============ | ||
======================================= =============================================================== ========================= | ||
Property Name Description Default | ||
======================================= =============================================================== ============ | ||
======================================= =============================================================== ========================= | ||
``format`` Optionally specifies the format of table data files, ``PARQUET`` | ||
either ``PARQUET`` or ``ORC``. | ||
|
||
|
@@ -388,7 +388,11 @@ Property Name Description | |
|
||
``metrics_max_inferred_column`` Optionally specifies the maximum number of columns for which ``100`` | ||
metrics are collected. | ||
======================================= =============================================================== ============ | ||
|
||
``read.split.target-size`` The target size for an individual split when generating splits ``134217728`` (128MB) | ||
for a table scan. Generated splits may still be larger or | ||
smaller than this value. Must be specified in bytes. | ||
======================================= =============================================================== ========================= | ||
|
||
The table definition below specifies format ``ORC``, partitioning by columns ``c1`` and ``c2``, | ||
and a file system location of ``s3://test_bucket/test_schema/test_table``: | ||
|
@@ -421,6 +425,9 @@ Property Name Description | |
``iceberg.rows_for_metadata_optimization_threshold`` Overrides the behavior of the connector property | ||
``iceberg.rows-for-metadata-optimization-threshold`` in the current | ||
session. | ||
``iceberg.target_split_size`` Overrides the target split size for all tables in a query in bytes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In hive, the counter part is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Iceberg and Hive function differently. Splits are not calculated in the same way. As such, the target size property does not refer to the absolute maximum size of the split. Either way, these are implementation details. If the property meant "max split size" it would be documented as such. This property does not represent the max split size and does not guarantee that splits will be below this size There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. In this case, can we add some clarification here: "Unlike hive.max_split_size, this can be smaller or greater than the actual split size" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added this to the description for the table property. The session property refers to the table property, so I think it should be clear enough |
||
Set to 0 to use the value in each Iceberg table's | ||
``read.split.target-size`` property. | ||
===================================================== ====================================================================== | ||
|
||
Caching Support | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems Presto convention does not use the prefixes from the Iceberg lib. E.g. "write.metadata.metrics.max-inferred-column-defaults" was just named "metrics_max_inferred_column" table property. While I do think the notion with prefixes is clearer, I think it's better to name it "split_target_size" for now. Please also add explanation here this correspond to
read.split.target-size
table property in Iceberg library.Later on, we can send a proposal to use full iceberg property names since it will affect the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd and I have discussed this in another PR and decided that moving forward for new properties we will use the iceberg property names. We will introduce backwards-compatible property names for the properties which were already introduced and slowly phase the old names out over the next few releases. I have filed this issue to address it: #24483 It includes the relevant context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that using Iceberg names is clearer. But ideally we should make changes for #24483 before directly using the new name. When and who will be working on it? Do you think you can send PR for that sooner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can try to get that PR out this week. I can definitely have it in soon, but since most users won't see any changes until the 292 release, I am in the camp that it would be fine to not hold up this PR to align the old property names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! It should be good as long as it's before the 292 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a draft PR for introducing the deprecation of table property names: #24581
It is still WIP. Just needs some tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Ping me when it's ready.