-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor multipart download to a more async model #10349
Refactor multipart download to a more async model #10349
Conversation
Compatibility status:Checks if related components are compatible with change a9b8b63 Incompatible componentsSkipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git] |
Gradle Check (Jenkins) Run Completed with:
|
5c06336
to
93ca05a
Compare
The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com>
93ca05a
to
a9b8b63
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
|
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #10349 +/- ##
============================================
- Coverage 71.21% 71.11% -0.10%
+ Complexity 58337 58295 -42
============================================
Files 4832 4830 -2
Lines 274828 274840 +12
Branches 40043 40048 +5
============================================
- Hits 195708 195444 -264
- Misses 62728 63071 +343
+ Partials 16392 16325 -67
|
I'm going to merge this to unblock subsequent PRs, but will follow up with @Bukhtawar and @gbbafna and address any comments or concerns. |
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-10349-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 28f185b347a3333c8670ca1a7bd7d0a85fed14e9
# Push it to GitHub
git push --set-upstream origin backport/backport-10349-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x Then, create a pull request where the |
…#10349) * Refactor read context streams to async streams Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Refactor multipart download to a more async model The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit 28f185b)
* Refactor read context streams to async streams Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Refactor multipart download to a more async model The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit 28f185b)
Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>
…#10349) * Refactor read context streams to async streams Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Refactor multipart download to a more async model The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com>
…#10349) * Refactor read context streams to async streams Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Refactor multipart download to a more async model The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com>
Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>
Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com> Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>
Related to opensearch-project/OpenSearch#10349 Signed-off-by: Andrew Ross <andrross@amazon.com>
…#10349) * Refactor read context streams to async streams Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Refactor multipart download to a more async model The previous approach of kicking off the stream requests for all parts of a file did not work well for very large files. For example, a 20GiB file uploaded in 16MiB parts will consist of 1200+ parts. When we attempted to initiate streaming for all parts concurrently, some parts would hit a client timeout after 2 minutes without being able to get a connection due to the other parts not having been completed in that time frame. This refactoring adds yet another layer of indirection in order to allow the code that is actually writing the destination file to control the rate at which streams are started. This should allow for downloading files consisting of arbitrarily many parts at any connection speed. This commit also wires in the download rate limiter so that the `indices.recovery.max_bytes_per_sec` is properly honored. Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>
The previous approach of kicking off the stream requests for all parts
of a file did not work well for very large files. For example, a 20GiB
file uploaded in 16MiB parts will consist of 1200+ parts. When we
attempted to initiate streaming for all parts concurrently, some parts
would hit a client timeout after 2 minutes without being able to get a
connection due to the other parts not having been completed in that time
frame. This refactoring adds yet another layer of indirection in order
to allow the code that is actually writing the destination file to
control the rate at which streams are started. This should allow for
downloading files consisting of arbitrarily many parts at any connection
speed.
This commit also wires in the download rate limiter so that the
indices.recovery.max_bytes_per_sec
is properly honored.This PR supersedes #10284
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.