-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg] Add table and session property for split size #24417
[Iceberg] Add table and session property for split size #24417
Conversation
Thanks for the release note! Nit formatting suggestion.
Consider documenting the new session property, perhaps in |
d42702a
to
958ceec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull branch, local doc build, looks good. Thanks!
abe3332
to
4a07851
Compare
4a07851
to
a9f8332
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bring one thing for discussion. From a high level perspective, considering that a complex sql (like tpcds sqls) may involve multiple tables, is it a suitable way to set a unified session level target split size to override their own split size? Or should we respect each table's own split size (if exist) more than the session level uniform value?
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java
Outdated
Show resolved
Hide resolved
...eberg/src/test/java/com/facebook/presto/iceberg/procedure/TestSetTablePropertyProcedure.java
Outdated
Show resolved
Hide resolved
a9f8332
to
04f8aaa
Compare
@@ -421,6 +424,9 @@ Property Name Description | |||
``iceberg.rows_for_metadata_optimization_threshold`` Overrides the behavior of the connector property | |||
``iceberg.rows-for-metadata-optimization-threshold`` in the current | |||
session. | |||
``iceberg.target_split_size`` Overrides the target split size for all tables in a query in bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hive, the counter part is MAX_SPLIT_SIZE = "max_split_size";
. Can we rename this one the same as Hive? Introducing different names for the same thing would make the users confused. We could also name both as "target_split_size", and the good thing about the "target_split_size" name is that it conforms with the Iceberg library, but "hive.max_split_size" is more accurate that its split sizes cannot exceed this number. For Iceberg, it is also the max split size even though it's named as "target" split size. (See org/apache/iceberg/FixedSizeSplitScanTaskIterator.java)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg and Hive function differently. Splits are not calculated in the same way. As such, the target size property does not refer to the absolute maximum size of the split. FixedSizeSplitScanTaskIterator
only applies to files which don't support offsets. Parquet and ORC do support offsets, so they most likely won't use that class, but instead the OffsetsWareSplitScanTaskIterator
from the iceberg library.
Either way, these are implementation details. If the property meant "max split size" it would be documented as such. This property does not represent the max split size and does not guarantee that splits will be below this size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. In this case, can we add some clarification here: "Unlike hive.max_split_size, this can be smaller or greater than the actual split size"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this to the description for the table property. The session property refers to the table property, so I think it should be clear enough
@@ -388,7 +388,10 @@ Property Name Description | |||
|
|||
``metrics_max_inferred_column`` Optionally specifies the maximum number of columns for which ``100`` | |||
metrics are collected. | |||
======================================= =============================================================== ============ | |||
|
|||
``read.split.target-size`` The target size for an individual split when generating splits ``134217728`` (128MB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems Presto convention does not use the prefixes from the Iceberg lib. E.g. "write.metadata.metrics.max-inferred-column-defaults" was just named "metrics_max_inferred_column" table property. While I do think the notion with prefixes is clearer, I think it's better to name it "split_target_size" for now. Please also add explanation here this correspond to read.split.target-size
table property in Iceberg library.
Later on, we can send a proposal to use full iceberg property names since it will affect the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd and I have discussed this in another PR and decided that moving forward for new properties we will use the iceberg property names. We will introduce backwards-compatible property names for the properties which were already introduced and slowly phase the old names out over the next few releases. I have filed this issue to address it: #24483 It includes the relevant context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that using Iceberg names is clearer. But ideally we should make changes for #24483 before directly using the new name. When and who will be working on it? Do you think you can send PR for that sooner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can try to get that PR out this week. I can definitely have it in soon, but since most users won't see any changes until the 292 release, I am in the camp that it would be fine to not hold up this PR to align the old property names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! It should be good as long as it's before the 292 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a draft PR for introducing the deprecation of table property names: #24581
It is still WIP. Just needs some tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a draft PR for introducing the deprecation of table property names: #24581
It is still WIP. Just needs some tests
Great! Ping me when it's ready.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSplitSource.java
Outdated
Show resolved
Hide resolved
New release note guidelines. Please remove the manual PR link in the following format from the release note entries for this PR.
I have updated the Release Notes Guidelines to remove the examples of manually adding the PR link. |
5030328
to
4a3c4ad
Compare
4a3c4ad
to
20fcd29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZacBlanco @steveburnett Shall we also mention we're adding a Iceberg table property read.split.target-size?
Thanks for the catch @yingsu00! Yes, doc should be added for all new properties introduced in a PR. |
@yingsu00 there is already a new entry in the doc for |
Description
Adds a session property for target split size.
Closes #24419
Motivation and Context
Makes it easier to do performance debugging by setting the desired split size on a per-query basis.
Impact
New configuration property.
Test Plan
unit tests
Contributor checklist
Release Notes