Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg] Add table and session property for split size #24417

Merged
merged 1 commit into from
Feb 18, 2025

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Jan 23, 2025

Description

Adds a session property for target split size.

Closes #24419

Motivation and Context

Makes it easier to do performance debugging by setting the desired split size on a per-query basis.

Impact

New configuration property.

Test Plan

unit tests

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
* Add ``target_split_size_bytes`` session property
* Add ``read.split.target-size`` table property

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Jan 23, 2025
@steveburnett
Copy link
Contributor

Thanks for the release note! Nit formatting suggestion.

== RELEASE NOTES ==

Iceberg Connector Changes
* Add ``target_split_size session`` property. :pr:`24417`

Consider documenting the new session property, perhaps in
https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/connector/iceberg.rst#session-properties .

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-split-size branch from d42702a to 958ceec Compare January 25, 2025 00:21
steveburnett
steveburnett previously approved these changes Jan 27, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local doc build, looks good. Thanks!

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-split-size branch 3 times, most recently from abe3332 to 4a07851 Compare January 27, 2025 22:14
@ZacBlanco ZacBlanco marked this pull request as ready for review January 28, 2025 01:36
@ZacBlanco ZacBlanco requested review from elharo, hantangwangd and a team as code owners January 28, 2025 01:36
@ZacBlanco ZacBlanco requested a review from presto-oss January 28, 2025 01:36
@ZacBlanco ZacBlanco changed the title [Iceberg] Session property for target split size [Iceberg] Add table and session property for split size Jan 28, 2025
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-split-size branch from 4a07851 to a9f8332 Compare February 5, 2025 16:57
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bring one thing for discussion. From a high level perspective, considering that a complex sql (like tpcds sqls) may involve multiple tables, is it a suitable way to set a unified session level target split size to override their own split size? Or should we respect each table's own split size (if exist) more than the session level uniform value?

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-split-size branch from a9f8332 to 04f8aaa Compare February 7, 2025 21:44
@@ -421,6 +424,9 @@ Property Name Description
``iceberg.rows_for_metadata_optimization_threshold`` Overrides the behavior of the connector property
``iceberg.rows-for-metadata-optimization-threshold`` in the current
session.
``iceberg.target_split_size`` Overrides the target split size for all tables in a query in bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hive, the counter part is MAX_SPLIT_SIZE = "max_split_size";. Can we rename this one the same as Hive? Introducing different names for the same thing would make the users confused. We could also name both as "target_split_size", and the good thing about the "target_split_size" name is that it conforms with the Iceberg library, but "hive.max_split_size" is more accurate that its split sizes cannot exceed this number. For Iceberg, it is also the max split size even though it's named as "target" split size. (See org/apache/iceberg/FixedSizeSplitScanTaskIterator.java)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg and Hive function differently. Splits are not calculated in the same way. As such, the target size property does not refer to the absolute maximum size of the split. FixedSizeSplitScanTaskIterator only applies to files which don't support offsets. Parquet and ORC do support offsets, so they most likely won't use that class, but instead the OffsetsWareSplitScanTaskIterator from the iceberg library.

Either way, these are implementation details. If the property meant "max split size" it would be documented as such. This property does not represent the max split size and does not guarantee that splits will be below this size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. In this case, can we add some clarification here: "Unlike hive.max_split_size, this can be smaller or greater than the actual split size"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this to the description for the table property. The session property refers to the table property, so I think it should be clear enough

@@ -388,7 +388,10 @@ Property Name Description

``metrics_max_inferred_column`` Optionally specifies the maximum number of columns for which ``100``
metrics are collected.
======================================= =============================================================== ============

``read.split.target-size`` The target size for an individual split when generating splits ``134217728`` (128MB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Presto convention does not use the prefixes from the Iceberg lib. E.g. "write.metadata.metrics.max-inferred-column-defaults" was just named "metrics_max_inferred_column" table property. While I do think the notion with prefixes is clearer, I think it's better to name it "split_target_size" for now. Please also add explanation here this correspond to read.split.target-size table property in Iceberg library.

Later on, we can send a proposal to use full iceberg property names since it will affect the users.

Copy link
Contributor Author

@ZacBlanco ZacBlanco Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd and I have discussed this in another PR and decided that moving forward for new properties we will use the iceberg property names. We will introduce backwards-compatible property names for the properties which were already introduced and slowly phase the old names out over the next few releases. I have filed this issue to address it: #24483 It includes the relevant context

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that using Iceberg names is clearer. But ideally we should make changes for #24483 before directly using the new name. When and who will be working on it? Do you think you can send PR for that sooner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can try to get that PR out this week. I can definitely have it in soon, but since most users won't see any changes until the 292 release, I am in the camp that it would be fine to not hold up this PR to align the old property names

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! It should be good as long as it's before the 292 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a draft PR for introducing the deprecation of table property names: #24581

It is still WIP. Just needs some tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a draft PR for introducing the deprecation of table property names: #24581

It is still WIP. Just needs some tests

Great! Ping me when it's ready.

@yingsu00 yingsu00 self-requested a review February 12, 2025 01:21
@steveburnett
Copy link
Contributor

New release note guidelines. Please remove the manual PR link in the following format from the release note entries for this PR.


:pr:`12345`

I have updated the Release Notes Guidelines to remove the examples of manually adding the PR link.

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-split-size branch 3 times, most recently from 5030328 to 4a3c4ad Compare February 13, 2025 00:05
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-split-size branch from 4a3c4ad to 20fcd29 Compare February 13, 2025 07:50
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZacBlanco @steveburnett Shall we also mention we're adding a Iceberg table property read.split.target-size?

@steveburnett
Copy link
Contributor

@ZacBlanco @steveburnett Shall we also mention we're adding a Iceberg table property read.split.target-size?

Thanks for the catch @yingsu00! Yes, doc should be added for all new properties introduced in a PR.

@ZacBlanco
Copy link
Contributor Author

@yingsu00 there is already a new entry in the doc for read.split.target-size in the table properties. I will add it to the release note though.

@ZacBlanco ZacBlanco merged commit c1ec5a7 into prestodb:master Feb 18, 2025
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PR from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make iceberg table target split size configurable
5 participants