Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve call performance for sync_partition_metadata utility #18384

Merged
merged 3 commits into from
Jan 7, 2023

Conversation

fgwang7w
Copy link
Member

@fgwang7w fgwang7w commented Sep 22, 2022

This PR fixes following issues:

Case 1: Current implementation for getting partition names takes partition values directly without taking into account of the actual path. e.g. The use of escaped character maybe encoded in HMS(Hive MetaStore) as partition name that could result incorrectness with current naming convention to fetch partition names.

  • Solution: Retrieve partition storage location correctly from metastore using correct API calls

Case 2: Sync partition utility lists file status of each partition twice. In case of handling a large of number of partitions for an external table, redundant check file status is a performance bottleneck.

  • Solution: Avoid file status check after knowing the path name from storage already by the sync partition procedure code path.

Case 3: For create partition call, batch size for number of trunks of partitions to be added to the metastore database is bounded by a fixed number. This limits the performance to update the metastore when the consumer of the underlying database is capable to handle a large number of batch and a large size of update batch.

  • Solution: Different metastore systems could implement this commit batch size differently based on different underlying database capacity.

== RELEASE NOTES ==

General Changes
* improve performance for procedure sync_partition_metadata 

Hive Changes
* Increase default partition batch size per commit for Hive Metastore up to 10. 
* Set default partition batch size per commit for Glue Metastore up to 10,000

Detail troubleshoot and solution design can be found in this reference page here

# of partitions Before (min) after(min) delta
2,526 5 1 5
12,000 54 6 9
39,493 92 12 7

@fgwang7w fgwang7w marked this pull request as ready for review September 22, 2022 07:01
@fgwang7w fgwang7w requested a review from a team as a code owner September 22, 2022 07:01
Copy link
Member

@agrawalreetika agrawalreetika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this.
The Benchmark number looks promising, this would be a good improvement for the sync-partition procedure.
I have taken the first pass on PR and added my comments.
These attached Benchmark results are from glue or HMS? Should we do it for both?

@fgwang7w fgwang7w force-pushed the syncpart-oss branch 2 times, most recently from 60d9881 to 946716a Compare September 29, 2022 04:08
Copy link
Member

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgwang7w Thank you for the optimisation. I did the first pass and I have a question regarding one change added in SyncPartitionMetadataProcedure.

Copy link
Member

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one more suggestion, please let me know what are your views on the same.

Copy link
Member

@agrawalreetika agrawalreetika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. Quick question, Are these attached Benchmark results from glue? Could we update the document with the same? Please update the PR release notes too about the partition batch.

@fgwang7w
Copy link
Member Author

The changes look good to me. Quick question, Are these attached Benchmark results from glue? Could we update the document with the same? Please update the PR release notes too about the partition batch.

sure, will update the release notes. many thanks for help to review.

@ethanyzhang
Copy link
Contributor

@tdcmeehan @prestodb/committers Hi Presto committers, this PR is reviewed internally already, can we have a second round review?

@fgwang7w fgwang7w force-pushed the syncpart-oss branch 2 times, most recently from d1d4e54 to 12d7d8c Compare December 7, 2022 21:50
@fgwang7w fgwang7w force-pushed the syncpart-oss branch 3 times, most recently from d1c911f to 9513fd6 Compare December 14, 2022 19:00
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgwang7w George, great improvement! In addition to the comments in code, there are several other nits:

  1. Commit title for 1 and 3 are too long
  2. Commit title first letter shall be upper case
  3. Can you use full sentences in the PR message?

@fgwang7w fgwang7w force-pushed the syncpart-oss branch 2 times, most recently from ceb14b8 to 98ec731 Compare December 18, 2022 20:50
@fgwang7w fgwang7w changed the title improve sync_partition_metadata performance Improve call performance for sync_partition_metadata utility Dec 18, 2022
@fgwang7w
Copy link
Member Author

Thank you @yingsu00 for review. All comments are addressed, could you please give another pass? Thanks!
Thank you @pranjalssh @agrawalreetika for approving the PR.

@fgwang7w
Copy link
Member Author

@prestodb/committers ping!

@fgwang7w fgwang7w force-pushed the syncpart-oss branch 5 times, most recently from a0caa9a to b9a5dc3 Compare December 27, 2022 19:34
@yingsu00 yingsu00 self-requested a review January 4, 2023 01:02
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgwang7w Mostly good, just some nits.
In the PR message and commit messages:

without take into account

-> without taking into account

hms

-> HMS(Hive Meta Store)

Case 2: Sync partition utility lists file status of each partition twice. In case of handling a large of number of partitions for an external table, redundant check file status is a performance bottleneck.

Case 2: Sync partition utility lists file status of each partition twice, causing performance bottleneck when processing an external table with a large of number of partitions.

Commit 1 & 3 titles are too long.
Use upper case for the first letter in each sentence, including all commit titles, commit/PR messages, newly added comments.

Current implementation for getting partition names takes partition values directly
without take into account of the actual path.
According to the [hive source code](shorturl.at/evT28), this logic may change in the
future. This commit updates the partition values with the relative path acquired from
the metastore APIs.
Reference to [HMS Repair Utility](shorturl.at/eHIXY), the size for adding partitions
can be in a range between 1 and 2,147,483,647.
For [Glue metastore](shorturl.at/pqDH9), it’s bounded by 100 for write access.
@yingsu00 yingsu00 merged commit 4fdd16e into prestodb:master Jan 7, 2023
@fgwang7w fgwang7w deleted the syncpart-oss branch January 9, 2023 03:19
@wanglinsong wanglinsong mentioned this pull request Jan 12, 2023
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants