-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure and GCS Repositories Needlessly Limit Chunk Size? #56018
Comments
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
We discussed this during snapshot resiliency sync and the consensus was that there isn't a good reason to have chunk sizes smaller than the maximum supported by the underlying blob store set per default. => We should adjust the default chunk size setting to the maximum blob size for each underlying blob store. |
Someone mentioned in the sync that we should document this as a breaking change but the conversation moved on before I had a chance to disagree, so I'm raising that here. I don't think we need to call the change to the default a breaking change. The default is already I think there's value in keeping the setting around indefinitely however, there are situations where the maximum blob size supported by the repository may be under the user's control (e.g. CephFS) or may not correspond with the maximum filesize of the data store. |
++ David, updated my comment accordingly. |
Removing these limits as they cause unnecessarily many object in the blob stores. We do not have to worry about BwC of this change since we do not support any 3rd party implementations of Azure or GCS. Also, since there is no valid reason to set a different than the default maximum chunk size at this point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting from the docs. Closes elastic#56018
#59279) Removing these limits as they cause unnecessarily many object in the blob stores. We do not have to worry about BwC of this change since we do not support any 3rd party implementations of Azure or GCS. Also, since there is no valid reason to set a different than the default maximum chunk size at this point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting from the docs. Closes #56018
elastic#59279) Removing these limits as they cause unnecessarily many object in the blob stores. We do not have to worry about BwC of this change since we do not support any 3rd party implementations of Azure or GCS. Also, since there is no valid reason to set a different than the default maximum chunk size at this point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting from the docs. Closes elastic#56018
#59279) (#59564) Removing these limits as they cause unnecessarily many object in the blob stores. We do not have to worry about BwC of this change since we do not support any 3rd party implementations of Azure or GCS. Also, since there is no valid reason to set a different than the default maximum chunk size at this point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting from the docs. Closes #56018
Currently, the Azure repository plugin seems to allow for a max chunk size of 256M only and the GCS repository is even limited to 100M.
For the S3 repository on the other hand we allow a maximum of 5TB for the chunk size and at least default to 1GB.
Chunking large files this heavily incurs a lot of overhead writing and restoring large files form the repositories. We are using multi-part uploading and its equivalents on all three implementations (S3, GCS, Azure) so there really is no need to artificially chunk files at all.
For S3 you could make an argument to chunk files because third party implementations of S3 might not like very large blobs.
For GCS and Azure I think we should simply set the chunk size to the maximum size allowed by the services by default (5T and 4T respectively). This will speed up restores of large files, keep the blob count in the repository lower (waste less resources on iterating through shard directories on deletes for example).
=> Is there any reason to keep the manual chunking around? If not we should go ahead with optimizing this. The longer we keep this kind of chunking the more slow snapshots/repositories we create.
The text was updated successfully, but these errors were encountered: