Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg] Collect data size statistics for Iceberg tables #22327

Merged

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Mar 25, 2024

Description

Previously, the data size statistic was computed by using the Iceberg data manifests data size field. This is value is misleading for Presto because it represents the compressed on-disk size rather than in-memory size.

This change allows ANALYZE to read and write data size statistic values to puffin files.

There are some other minor changes included in this PR

  • Updates the iceberg.hive-statistics-merge-strategy configuration to pass a comma-separated list of overrides
  • Updates the Iceberg changelog test so it doesn't always use port 8080

Motivation and Context

Previous statistics reported by data size are wrong

Closes #22208

Impact

Presto-generated puffin files will now include a blob for data size after running ANALYZE

Test Plan

New tests added to read+write data size.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

General Changes
* The `iceberg.hive-statistics-merge-strategy` flag has been updated to accept a comma-separated list of the following values: NUMBER_OF_DISTINCT_VALUES, TOTAL_SIZE_IN_BYTES

@ZacBlanco ZacBlanco changed the title [ICEBERG] Collect data size statistics for Iceberg tables [Iceberg] Collect data size statistics for Iceberg tables Mar 25, 2024
@ZacBlanco ZacBlanco marked this pull request as ready for review March 25, 2024 19:35
@ZacBlanco ZacBlanco requested a review from a team as a code owner March 25, 2024 19:35
@ZacBlanco ZacBlanco requested a review from presto-oss March 25, 2024 19:35
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch 2 times, most recently from adb9ce9 to 1ba75fa Compare March 26, 2024 16:18
@ZacBlanco ZacBlanco requested a review from steveburnett as a code owner March 26, 2024 16:18
Copy link

github-actions bot commented Mar 26, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff fdfa3fe...de3852d.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/connector/iceberg.rst

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch from 1ba75fa to 06e6bef Compare March 26, 2024 16:29
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! One suggestion about moving a sentence to improve readability. Let me know what you think, and if you have a better idea.

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response, looks good. One tiny nit that I think I overlooked previously.

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch from c8ae6a7 to 13b66bc Compare March 26, 2024 21:19
steveburnett
steveburnett previously approved these changes Mar 26, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local build, reviewed and noticed the new content and formatting fixes, everything looks good. Thanks!

@tdcmeehan tdcmeehan self-assigned this Mar 27, 2024
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, some little things for discussion.

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch 2 times, most recently from c449f36 to 437b48a Compare April 1, 2024 18:22
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Only one little thing. Thanks for the important improvement.

@tdcmeehan Would you like to take a final look?

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch 3 times, most recently from 5d9f55c to 7af17ae Compare April 2, 2024 22:11
hantangwangd
hantangwangd previously approved these changes Apr 2, 2024
@ZacBlanco ZacBlanco requested a review from steveburnett April 2, 2024 23:19
steveburnett
steveburnett previously approved these changes Apr 3, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local build.

Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing tests, but changes LGTM overall

@@ -459,6 +522,10 @@ public static List<ColumnStatisticMetadata> getSupportedColumnStatistics(String
supportedStatistics.add(NUMBER_OF_DISTINCT_VALUES.getColumnStatisticMetadataWithCustomFunction(columnName, "sketch_theta"));
}

if (!(type instanceof FixedWidthType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I think this is clearer than before now about what types we track the size for

@ZacBlanco ZacBlanco dismissed stale reviews from steveburnett and hantangwangd via ba65e51 April 4, 2024 19:28
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch from 7af17ae to ba65e51 Compare April 4, 2024 19:28
@ZacBlanco
Copy link
Contributor Author

@aaneja @hantangwangd FYI while testing some of this code again I also noticed that the total table size result was incorrect so I also added a function to calculate it properly:

public static TableStatistics.Builder calculateAndSetTableSize(TableStatistics.Builder builder)
{
return builder.setTotalSize(builder.getRowCount().flatMap(rowCount -> builder.getColumnStatistics().entrySet().stream().map(entry -> {
IcebergColumnHandle columnHandle = (IcebergColumnHandle) entry.getKey();
ColumnStatistics stats = entry.getValue();
return stats.getDataSize().or(() -> {
if (columnHandle.getType() instanceof FixedWidthType) {
return stats.getNullsFraction().map(nulls -> rowCount * (1 - nulls) * ((FixedWidthType) columnHandle.getType()).getFixedSize());
}
else {
return Estimate.unknown();
}
});
}).reduce(Estimate.of(0.0), (currentSize, newSize) -> currentSize.flatMap(current -> newSize.map(add -> current + add)))));
}

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch from ba65e51 to f7ebe67 Compare April 4, 2024 20:58
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch 2 times, most recently from 069e7e5 to 9b37792 Compare April 5, 2024 17:15
@ZacBlanco ZacBlanco requested a review from jaystarshot as a code owner April 5, 2024 17:15
hantangwangd
hantangwangd previously approved these changes Apr 6, 2024
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix!

@aaneja aaneja self-requested a review April 11, 2024 18:33
Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a few small things

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch 3 times, most recently from f490f07 to 71f56ed Compare April 12, 2024 01:50
hantangwangd
hantangwangd previously approved these changes Apr 12, 2024
Previously, the data size statistic was computed by using the
Iceberg data manifests data size field. This is value is misleading
for Presto because it represents the compressed on-disk size.
This change allows ANALYZE to read and write data size statistic
values to puffin files.

This change also updates the hive-statistics-merge-strategy
config value in the Iceberg connector to accept a comma-separated
list of valid values to override from the HMS instead of using
an independent enum. This allows for a wider variety of combinations
using less code.
@ZacBlanco ZacBlanco dismissed stale reviews from hantangwangd and aaneja via de3852d April 12, 2024 14:37
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-data-size-analyze branch from 71f56ed to de3852d Compare April 12, 2024 14:37
Hive Metastore. The available values are ``NONE``,
``USE_NULLS_FRACTION_AND_NDV``, ``USE_NULLS_FRACTIONS``
and, ``USE_NDV``
``iceberg.hive-statistics-merge-strategy`` Comma separated list of statistics to use from the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the behavior of this flag is changing from release to release, I think we'd want to introduce a new flag for this behavior, and deprecate the old flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a process for deprecation? I felt this was safe, because very few (if any) users consume these flags except for our internal benchmarks. I can add a release note for it.

Also, since Presto doesn't have a 1.X release, According to SemVer rules, this should be OK to change between releases.

Copy link
Contributor

@tdcmeehan tdcmeehan Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZacBlanco that is definitely not the case. Presto is over a decade old, and people rely on features not breaking release to release.

I think in this case though, it's fine, because it hasn't made it to our public documentation yet (it didn't seem to make it to 0.286). Please add a release note, and ensure the next release contains the latest documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The release notes have been added to the PR description

@ZacBlanco ZacBlanco merged commit 457d812 into prestodb:master Apr 12, 2024
57 checks passed
@wanglinsong wanglinsong mentioned this pull request May 1, 2024
48 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Iceberg] variable-width column data sizes are generally wrong
5 participants