Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

botocore no longer populates the Content-MD5 header leading to MissingContentMD5 error #931

Open
bollard opened this issue Jan 19, 2025 · 23 comments

Comments

@bollard
Copy link

bollard commented Jan 19, 2025

Hello,

As of version 1.36.0, botocore no longer populates the Content-MD5 header (see changelog entry here). This change was subsequently merged into aiobotocore as of version 2.18 (see commit here).

Practically, this now seems to mean that when I try to perform a delete operation on an S3FS file system I receive the following error:

  File "/usr/local/lib/python3.12/site-packages/s3fs/core.py", line 114, in _error_wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiobotocore/client.py", line 412, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (MissingContentMD5) when calling the DeleteObjects operation: Missing required header for this request: Content-Md5.

So far my only work around is to pin aiobotocore < 2.18. I am using the latest S3FS (2024.12.0).

Thanks

@martindurant
Copy link
Member

Thanks for bringing this to my attention.

Is this against AWS, or another implementation of S3? If yes, how are you expected to delete files now?

@bollard
Copy link
Author

bollard commented Jan 19, 2025

No, I'm using S3FS to interact with an internal Minio instance (and to be honest I don't know enough about AWS/S3 to answer the follow up - it just appears to me to be a potentially very impactful change in behaviour).

Just to follow-up, have tried to look though what I believe to be the offending commit (here) and perhaps request_checksum_calculation now needs to be set?

@martindurant
Copy link
Member

OK, so I gather AWS must have switched to CRC and minio (maybe depending on deployment version) has not.

The doc suggests that changing the value of client_config.request_checksum_calculation to "when_supported" in the config (or the AWS_REQUEST_CHECKSUM_CALCULATION env variable) will only affect whether the CRC is calculated, never MD5, where all the associated code is marked deprecated. Maybe still worth a try?

@pitrou
Copy link

pitrou commented Jan 20, 2025

Upstream Minio issue: minio/minio#20845

@malmans2
Copy link
Contributor

We’re running into a similar issue, though it’s slightly different:

OSError: [Errno 22] An error occurred (MissingContentLength) when calling the PutObject operation: Unknown

It looks like this is a breaking change in boto3:

Would this be something that can be fixed in s3fs, or does it need to be handled in one of the dependencies?

@hutch3232
Copy link
Contributor

botocore 1.36.0 also broke s3fs for my S3-compatible on-prem deployment. This is reproducible for me:

# /// script
# requires-python = ">=3.9"
# dependencies = [
#     "pandas",
#     "s3fs",
#     "botocore==1.36",
# ]
# ///

import s3fs
import pandas as pd
 

s3 = s3fs.S3FileSystem(profile="my-profile")
df = pd.DataFrame({"my_col":[1, 2, 3]})
df.to_csv("/tmp/test_df.csv")
s3.put("/tmp/test_df.csv", "s3://my-bucket/my-prefix/test_df.csv")

# when botocore<1.36:

# ,my_col
# 0,1
# 1,2
# 2,3

# when botocore==1.36.0

# 14
# ,my_col
# 0,1
# 1,2

Essentially there is some kind of data corruption by a random string (or number?) being put at the top of my csv. In this case 14.

(ran the above as a PEP 722 script using uv)

@martindurant
Copy link
Member

As far as I know, the only solution currently is to downgrade botocore. I don't know if there's any scope for s3fs to add extra headers to add extra headers, since the values are calculated on the finished HTTP request after control has passed to botocore.

Unfortunately, it doesn't seem like botocore is interested in maintaining compatibility, since they explicitly target AWS.

Having said that, I'm surprised to see PutObject implicated too - either with the client error (which seems to be the same issue) or data corruption (which may well be something else). In the case of PutObject, we do always know the length of the body beforehand, so we can pass it explicitly if we know the header key required.

@martindurant
Copy link
Member

Perhaps someone can do a trace to see how the calls differ between the new and old botocore?

I have another emergency I need to deal with today...

@boringbyte
Copy link

boringbyte commented Jan 30, 2025

@martindurant, I think changes made to checksum in PR boto/botocore#3271 are likely causing this issue. Setting environment variable AWS_REQUEST_CHECKSUM_CALCULATION to WHEN_REQUIRED might address the issue.

@martindurant
Copy link
Member

@boringbyte , I don't think so. In fact, "required" is the default; setting it to the more general "when_available" doesn't help either, though, since it still produces a CRC rather than the previous behaviour with MD5.

@bollard
Copy link
Author

bollard commented Jan 30, 2025

Just to follow up, updating Minio to the latetst version (RELEASE.2025-01-20T14-49-07Z) resolved the issue for me. I therefore think this can be closed as this is an upstream boto / minio issue. Thank you

@martindurant
Copy link
Member

I'll leave it open for now as the ecosystem catches up - and maybe someone comes up with a way to inject those headers for older deployments.

@ryanovas
Copy link

ryanovas commented Feb 3, 2025

It seems to me there is a way to disable this behaviour according to the issue on botocore: boto/boto3#4398 (comment)

Is it not possible for us to pass in some kind of extra config to enable this?

@martindurant
Copy link
Member

That config can be changed via environment variable ( #931 (comment) ), so please do try it!

@ryanovas
Copy link

ryanovas commented Feb 3, 2025

Environment variables are fine and dandy, but it seems like a limited solution to need to know about and set an env var in every place this might be running. Plus, not all of us are here because we use s3fs directly - in my case it's because pyiceberg relies on s3fs. It would be much more effective imo for us and other libs using s3fs to be able to set a flag directly in our code that carries across to all environments.

@martindurant
Copy link
Member

Environment variables are fine and dandy,

The question is: does this workaround solve the problem? If yes, we can work out how to expose it programatically.

@ryanovas
Copy link

ryanovas commented Feb 3, 2025

@martindurant I can confirm adding the environment variable fixes the problem.

export AWS_REQUEST_CHECKSUM_CALCULATION='WHEN_REQUIRED'

@martindurant
Copy link
Member

Thanks for testing.

request_checksum_calculation appears in the botocore config (https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html ), so I would try passing it using client_kwargs or config_kwargs to s3fs.

Assuming one/both of those works, then I suppose we are done: we have a workaround. However, we might still try to make this more prominent, provide extra documentation or try to catch that exact exception and provide remediation instructions.

@malmans2
Copy link
Contributor

malmans2 commented Feb 5, 2025

It works with config_kwargs only.

Side note: Unfortunately, using request_checksum_calculation gives an error when boto3<1.36: TypeError: Got unexpected keyword argument 'request_checksum_calculation'

@martindurant
Copy link
Member

Unfortunately, using request_checksum_calculation gives an error when boto3<1.36

OK, so we certainly can't make this default.

What is the opinion here, is this thread enough to get people working again? Do we need a documentation note somewhere?

@hutch3232
Copy link
Contributor

export AWS_REQUEST_CHECKSUM_CALCULATION='WHEN_REQUIRED'

Wanted to update that this fixes my data corruption issue posted about above: #931 (comment)

I think it would be very beneficial if s3fs was able to do something to automatically fix this. I realize this issue is not at all s3fs' fault. I just think that many power users will now have to remember to set this env var in all of their environments or scripts. Non power users - say users benefiting from pandas wrapping around it behind the scenes - will be very confused about why they're now getting data corruption. No error is actually thrown in my example above, which makes it even more difficult.

@tcrasset
Copy link

Couldn't we programmatically add that in the config kwargs by if we see that botocore > 1.36 is installed?

from importlib.metadata import version
major, minor, patch = version('botocore').split('.')
if int(major) >= 1 and int(minor) >= 36:
    config_kwargs["request_checksum_calculation"] = "when_required"

@martindurant
Copy link
Member

Couldn't we programmatically add that in the config kwargs by if we see that botocore > 1.36 is installed?

Is it not the case that this config should not be set in the case that the endpoint is real AWS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants