-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg] Collect data size statistics for Iceberg tables #22327
[Iceberg] Collect data size statistics for Iceberg tables #22327
Conversation
adb9ce9
to
1ba75fa
Compare
Codenotify: Notifying subscribers in CODENOTIFY files for diff fdfa3fe...de3852d.
|
1ba75fa
to
06e6bef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc! One suggestion about moving a sentence to improve readability. Let me know what you think, and if you have a better idea.
06e6bef
to
c8ae6a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response, looks good. One tiny nit that I think I overlooked previously.
c8ae6a7
to
13b66bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local build, reviewed and noticed the new content and formatting fixes, everything looks good. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, some little things for discussion.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergHiveMetadata.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergHiveMetadata.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java
Show resolved
Hide resolved
c449f36
to
437b48a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Only one little thing. Thanks for the important improvement.
@tdcmeehan Would you like to take a final look?
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergHiveMetadata.java
Outdated
Show resolved
Hide resolved
5d9f55c
to
7af17ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing tests, but changes LGTM overall
presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/StatisticsUtil.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/StatisticsUtil.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergHiveMetadata.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergHiveMetadata.java
Show resolved
Hide resolved
@@ -459,6 +522,10 @@ public static List<ColumnStatisticMetadata> getSupportedColumnStatistics(String | |||
supportedStatistics.add(NUMBER_OF_DISTINCT_VALUES.getColumnStatisticMetadataWithCustomFunction(columnName, "sketch_theta")); | |||
} | |||
|
|||
if (!(type instanceof FixedWidthType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I think this is clearer than before now about what types we track the size for
ba65e51
7af17ae
to
ba65e51
Compare
@aaneja @hantangwangd FYI while testing some of this code again I also noticed that the total table size result was incorrect so I also added a function to calculate it properly: presto/presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/StatisticsUtil.java Lines 101 to 115 in ba65e51
|
ba65e51
to
f7ebe67
Compare
presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/StatisticsUtil.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/StatisticsUtil.java
Show resolved
Hide resolved
069e7e5
to
9b37792
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except a few small things
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergHiveMetadata.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/StatisticsUtil.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/StatisticsUtil.java
Show resolved
Hide resolved
f490f07
to
71f56ed
Compare
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java
Outdated
Show resolved
Hide resolved
Previously, the data size statistic was computed by using the Iceberg data manifests data size field. This is value is misleading for Presto because it represents the compressed on-disk size. This change allows ANALYZE to read and write data size statistic values to puffin files. This change also updates the hive-statistics-merge-strategy config value in the Iceberg connector to accept a comma-separated list of valid values to override from the HMS instead of using an independent enum. This allows for a wider variety of combinations using less code.
71f56ed
to
de3852d
Compare
Hive Metastore. The available values are ``NONE``, | ||
``USE_NULLS_FRACTION_AND_NDV``, ``USE_NULLS_FRACTIONS`` | ||
and, ``USE_NDV`` | ||
``iceberg.hive-statistics-merge-strategy`` Comma separated list of statistics to use from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the behavior of this flag is changing from release to release, I think we'd want to introduce a new flag for this behavior, and deprecate the old flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a process for deprecation? I felt this was safe, because very few (if any) users consume these flags except for our internal benchmarks. I can add a release note for it.
Also, since Presto doesn't have a 1.X release, According to SemVer rules, this should be OK to change between releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZacBlanco that is definitely not the case. Presto is over a decade old, and people rely on features not breaking release to release.
I think in this case though, it's fine, because it hasn't made it to our public documentation yet (it didn't seem to make it to 0.286). Please add a release note, and ensure the next release contains the latest documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The release notes have been added to the PR description
Description
Previously, the data size statistic was computed by using the Iceberg data manifests data size field. This is value is misleading for Presto because it represents the compressed on-disk size rather than in-memory size.
This change allows ANALYZE to read and write data size statistic values to puffin files.
There are some other minor changes included in this PR
iceberg.hive-statistics-merge-strategy
configuration to pass a comma-separated list of overridesMotivation and Context
Previous statistics reported by data size are wrong
Closes #22208
Impact
Presto-generated puffin files will now include a blob for data size after running
ANALYZE
Test Plan
New tests added to read+write data size.
Contributor checklist
Release Notes