Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Artificially Low Chunk Size Limits from GCS + Azure Blob Stores #59279

Merged
merged 6 commits into from
Jul 14, 2020

Conversation

original-brownbear
Copy link
Member

Removing these limits as they cause unnecessarily many object in the blob stores.
We do not have to worry about BwC of this change since we do not support any 3rd party
implementations of Azure or GCS.
Also, since there is no valid reason to set a different than the default maximum chunk size at this
point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting
from the docs.

Closes #56018

Removing these limits as they cause unnecessarily many object in the blob stores.
We do not have to worry about BwC of this change since we do not support any 3rd party
implementations of Azure or GCS.
Also, since there is no valid reason to set a different than the default maximum chunk size at this
point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting
from the docs.

Closes elastic#56018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 9, 2020
@original-brownbear
Copy link
Member Author

@ywelsch @tlrx I only did a quick manual test for this and it seems everything works just fine for larger blobs still. I guess our benchmarking efforts will stress test this anyway with GB-scale files so we should be on the safe side here?

@@ -162,12 +162,6 @@ The Azure repository supports following settings:
Specifies the path within container to repository data. Defaults to empty
(root directory).

`chunk_size`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still keep the docs for chunk_size. Perhaps we can mention that lowering this value can result in much more files in the repo. Still, it's a valid option and should be documented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, it's a valid option and should be documented.

Is it though? Setting this option has no advantages whatsoever doesn't it? And we had a few cases both internal and external where people mistook this option as some sort of buffer size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this option has no advantages whatsoever doesn't it?

In the searchable snapshot case, it allows parallel downloads of file chunks. Not saying that that's a good use case, just that there might be some that we don't know about.

And we had a few cases both internal and external where people mistook this option as some sort of buffer size.

This could be fixed by improving the documentation :)

I prefer full transparency on this setting (saying what it's default is and why), and actively recommend against tuning it, but not hide it from the docs (as folks might otherwise set it based on old docs / S3 docs and have completely wrong expectations).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I pushed f45a4f7 :)

@@ -226,12 +226,6 @@ The following settings are supported:
Specifies the path within bucket to repository data. Defaults to
the root of the bucket.

`chunk_size`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@original-brownbear
Copy link
Member Author

(test are broken because they are broken in master currently ...)

Specify the chunk size as a value and unit, for example:
`10MB`, `5KB`, `500B`. Defaults to `64MB` (64MB max).
`10MB`, `5KB`, `500B`. Defaults to the maximum size of a blob in the Azure blob store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's spell out what the value is (same for GCS)

@original-brownbear
Copy link
Member Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit 60e0b46 into elastic:master Jul 14, 2020
@original-brownbear original-brownbear deleted the 56018 branch July 14, 2020 19:34
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 14, 2020
elastic#59279)

Removing these limits as they cause unnecessarily many object in the blob stores.
We do not have to worry about BwC of this change since we do not support any 3rd party
implementations of Azure or GCS.
Also, since there is no valid reason to set a different than the default maximum chunk size at this
point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting
from the docs.

Closes elastic#56018
original-brownbear added a commit that referenced this pull request Jul 14, 2020
#59279) (#59564)

Removing these limits as they cause unnecessarily many object in the blob stores.
We do not have to worry about BwC of this change since we do not support any 3rd party
implementations of Azure or GCS.
Also, since there is no valid reason to set a different than the default maximum chunk size at this
point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting
from the docs.

Closes #56018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure and GCS Repositories Needlessly Limit Chunk Size?
4 participants