Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure and GCS Repositories Needlessly Limit Chunk Size? #56018

Closed
original-brownbear opened this issue Apr 30, 2020 · 4 comments · Fixed by #59279
Closed

Azure and GCS Repositories Needlessly Limit Chunk Size? #56018

original-brownbear opened this issue Apr 30, 2020 · 4 comments · Fixed by #59279
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@original-brownbear
Copy link
Member

Currently, the Azure repository plugin seems to allow for a max chunk size of 256M only and the GCS repository is even limited to 100M.
For the S3 repository on the other hand we allow a maximum of 5TB for the chunk size and at least default to 1GB.

Chunking large files this heavily incurs a lot of overhead writing and restoring large files form the repositories. We are using multi-part uploading and its equivalents on all three implementations (S3, GCS, Azure) so there really is no need to artificially chunk files at all.

For S3 you could make an argument to chunk files because third party implementations of S3 might not like very large blobs.

For GCS and Azure I think we should simply set the chunk size to the maximum size allowed by the services by default (5T and 4T respectively). This will speed up restores of large files, keep the blob count in the repository lower (waste less resources on iterating through shard directories on deletes for example).

=> Is there any reason to keep the manual chunking around? If not we should go ahead with optimizing this. The longer we keep this kind of chunking the more slow snapshots/repositories we create.

@original-brownbear original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs team-discuss labels Apr 30, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 30, 2020
@original-brownbear
Copy link
Member Author

original-brownbear commented May 6, 2020

We discussed this during snapshot resiliency sync and the consensus was that there isn't a good reason to have chunk sizes smaller than the maximum supported by the underlying blob store set per default.

=> We should adjust the default chunk size setting to the maximum blob size for each underlying blob store.
=> Keep the setting around to support edge cases where the blob store may not be able to handle large files the same way the data store can
=> Document the change (no need for this)

@DaveCTurner
Copy link
Contributor

Someone mentioned in the sync that we should document this as a breaking change but the conversation moved on before I had a chance to disagree, so I'm raising that here. I don't think we need to call the change to the default a breaking change. The default is already Long.MAX_VALUE on FsRepository, and we only need to care about the official services for the other repositories (plus Minio which claims to support objects up to 5TB just like S3 proper).

I think there's value in keeping the setting around indefinitely however, there are situations where the maximum blob size supported by the repository may be under the user's control (e.g. CephFS) or may not correspond with the maximum filesize of the data store.

@original-brownbear
Copy link
Member Author

++ David, updated my comment accordingly.

@original-brownbear original-brownbear self-assigned this Jul 9, 2020
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 9, 2020
Removing these limits as they cause unnecessarily many object in the blob stores.
We do not have to worry about BwC of this change since we do not support any 3rd party
implementations of Azure or GCS.
Also, since there is no valid reason to set a different than the default maximum chunk size at this
point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting
from the docs.

Closes elastic#56018
original-brownbear added a commit that referenced this issue Jul 14, 2020
#59279)

Removing these limits as they cause unnecessarily many object in the blob stores.
We do not have to worry about BwC of this change since we do not support any 3rd party
implementations of Azure or GCS.
Also, since there is no valid reason to set a different than the default maximum chunk size at this
point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting
from the docs.

Closes #56018
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 14, 2020
elastic#59279)

Removing these limits as they cause unnecessarily many object in the blob stores.
We do not have to worry about BwC of this change since we do not support any 3rd party
implementations of Azure or GCS.
Also, since there is no valid reason to set a different than the default maximum chunk size at this
point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting
from the docs.

Closes elastic#56018
original-brownbear added a commit that referenced this issue Jul 14, 2020
#59279) (#59564)

Removing these limits as they cause unnecessarily many object in the blob stores.
We do not have to worry about BwC of this change since we do not support any 3rd party
implementations of Azure or GCS.
Also, since there is no valid reason to set a different than the default maximum chunk size at this
point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting
from the docs.

Closes #56018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
3 participants