Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

md5 hash with upload_fileobj. #845

Closed
sidheshdivekar29 opened this issue Oct 12, 2016 · 4 comments
Closed

md5 hash with upload_fileobj. #845

sidheshdivekar29 opened this issue Oct 12, 2016 · 4 comments
Labels
closing-soon This issue will automatically close in 4 days unless further comments are made. question

Comments

@sidheshdivekar29
Copy link

sidheshdivekar29 commented Oct 12, 2016

Hi,
Does upload_fileobj take care of making sure the md5 of
the file being uploaded matches the md5 of the uploaded file in s3
once upload is over ?

I see since upload_fileobj is done in multipart md5 hash in etag is of the format hash-2
and differs from the original md5 of the file.

How to make sure md5 of the file being uploaded matches the md5 of the file in s3 efficiently.
If upload_fileobj makes sure of taking care of integrity then application can safely assume
that object went to S3 and don't have to implement anything to match the md5's.

@kyleknap
Copy link
Contributor

Yeah so the md5 is only equal to the ETag under certain circumstances (i.e. non multipart upload).

Boto3 does not integrity check the md5 of the entire file for multipart uploads, but it does send an md5 header for each part that is uploaded when doing a multipart upload such that if there is an md5 mismatch in any of the parts, it will retry the request until it is correct.

That is probably the best we can do while still doing the transfer efficiently because determining the MD5 and doing integrity checking would require streaming the entire file upfront into memory.

How large are the files you are uploading? If they are small enough you may be able to increase the multipart threshold so multipart uploads are not used if md5 checking of each individual part is not sufficient.

@kyleknap kyleknap added question closing-soon This issue will automatically close in 4 days unless further comments are made. labels Oct 14, 2016
@sidheshdivekar29
Copy link
Author

Thanks for replying. I think we should be good if library is matching md5's of the individual
chunks uploaded and retrying in case of failures.
Is there a command or way to enable debugs/traces to check the no. of tries which happened/happening and if they were successfull.

@joguSD
Copy link
Contributor

joguSD commented Jul 20, 2017

You can turn on debug logs by adding: boto3.set_stream_logger('').
The number of retries is also available in the response metadata.
ResponseMetadata.RetryAttempts

@sidheshdivekar29
Copy link
Author

Thanks for adding the logging support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closing-soon This issue will automatically close in 4 days unless further comments are made. question
Projects
None yet
Development

No branches or pull requests

3 participants