-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
velero backup create fails to upload backup to s3 using aws plugin #7543
Comments
It looks like you may have the wrong CRDs installed. BackupVolumeInfos wa sa new vale added to |
Thanks for the quick response. I found in the doc I can run:
So I run the above using velero 1.12.2 and 1.13.1 and as you said, I found BackupVolumeInfos in the output produced by 1.13.1:
My next question is how do I update CRDS from 1.12.2 to 1.13.1? |
Here is one doc that you could reference |
Thanks for the link to the doc. I have run the following:
But the backup still fails:
Maybe it is still adding "@aws" suffix to the key id? |
Not sure about the Could you post the error information of the failed backup? |
Thanks for looking into this problem. Here is the error I found for the failed backup:
The s3 support recommended: We usually see a suffix of @aws in the access_key_id of HMAC access when the s3 signature/presigned URL is not correct. We suggest engaging Velero support to investigate if they have a behavior change on the s3 signature/presigned in their new version. |
What is the backup repository's backend? |
The S3 backend is IBM Cloud Object Storage that behaves like AWS S3. |
Hmm. Looks like you may have the wrong bucket permissions for your s3 bucket. See the bucket policies section at https://github.com/vmware-tanzu/velero-plugin-for-aws/blob/main/README.md and compare with what you have. |
Thanks for the link to the documentation. As I mentioned earlier, velero 1.12.2 and aws plugin 1.8.2 backup used to work for us. So not sure why it stopped working when we upgrade to 1.13.0 and 1.9.0 or 1.13.1 and 1.9.1? Here is the velero install command we use for many versions of velero including 1.11 and earlier versions:
|
I can't think of any changes we've made to the way we handle uploads that would trigger new permission requirements between 1.12 and 1.13, although maybe there's something I'm not aware of. It may be worth creating a new bucket and making sure it has the recommended bucket policy in place to see whether this works, which will eliminate the possibility that something changed in the bucket itself. |
We tried the following combinations: Velero vs aws plugin So we suspect aws plugin 1.9.1 is adding "@aws" to end of key id so velero fails to upload backup to IBM Cloud Object Storage? |
The issue may relate to the AWS SDK version bump in the Velero AWS plugin version v1.9. Did you see that in the secret, the pod, or the Velero log? |
I contacted IBM Cloud Object Storage and they said they found the following in their log (note suffix "@aws" at end of remote_user):
|
We have the same issue as described here and we are using official Amazon S3. Let me know if you need any logs |
IMO, this "@aws" may not be an issue. The 403 error code implies permission denied. |
As I mentioned previously, we have tried the newest version of velero 1.13.1 vs the newest version of plugin 1.9.1 and it failed. But if we switch to older version of plugin 1.8.2 then it works. In both cases, we have the same permission. |
Since aws-plugin v1.9.x, we've switched to aws-sdk-go-v2, so there might be compatibility issue. Some change in sdk-v2 makes IBM Object Storage think I may look into the code, but I can't commit a fix b/c currently the plugin works with AWS-S3 and S3-Compatible storage (minio) in our pipeline. |
@reasonerjt Yes, I will report to IBM Cloud Object Storage with your findings. But please also be informed that @Alwinius said he also has problem with Amazon S3. |
I also experienced the problem in IBM Cloud with aws plugin v.1.9.1 |
@reasonerjt IBM Cloud Object Storage team replied: The expected of the remote user should be the access key ID of the HMAC without tailing with the @aws. For example: "3f3dad27c65d41b4835b8a3be6d91cb0@aws", the ""3f3dad27c65d41b4835b8a3be6d91cb0" is the expected access key ID. |
@Wayne-H-Ha |
@reasonerjt I just got the reply from IBM Cloud Object Storage (COS). I hope you understand the reply as I don't have enough knowledge to digest the information. COS internal managed to capture debug logged requests for HTTP 403 for PUT. Specifically, the AWS signature does not match what we are expecting and so stop processing the request any further. Request_id 1) 0ed2fc0b-acf8-4d05-b003-dd5a1bf1b072: 2024-04-02 03:30:32.330 DEBUG [etp466364426-20827] in the other: Request_id 2) 5982df29-85a9-4492-9573-54aaba4b484e: 2024-04-02 03:30:32.319 DEBUG [etp579017959-19571] Checking COS logs, they can see all HTTP 403 for PUT were for user_agent |
@Wayne-H-Ha are you seeing other debug logs? If not, it might be better to replace the last two lines with a combined |
(oh, I'm just noticing that the docs suggest it the way you had it -- so the main question is whether you're seeing other "level=debug" logs. There should be many of them if log level is debug. If they're not, then we'll need to figure out why the setting isn't working. If there are, then we may need to look into what exactly should be logged here and which of those messages you're seeing and which you aren't. |
I found out the sdkv2 by default do not produce logs.. I'm PRing to aws plugin in a bit. |
Have to set clientlogmode https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/logging/#clientlogmode |
I see more than 40 K entries for level=debug and more than 10 K entries for level=info and only 2 entries for level=error:
|
@Wayne-H-Ha try this image with debug logging enabled. from vmware-tanzu/velero-plugin-for-aws#207 Then relay that info to IBM COS |
@kaovilai Thanks for providing the image with debug logging enabled. I have reproduced the problem and sent the new logs to IBM COS for them to investigate. |
Let us know of any updates. |
IBM COS said since our bucket has retention policy set, setting checksumAlgorithm to "" will not work for us. They need to implement sdkv2 support in IBM COS. |
@Wayne-H-Ha So does this mean a new version of IBM COS will be needed? Is this on the roadmap? |
IBM COS said they are working on implementing sdkv2 support in their product. |
As per vmware-tanzu#7543 setting checksumAlgorithm to avoid 403 errors Added plugins line as velero install failed without this option in version 1.14.0 Removed the volumesnapshotlocation as it does not exist in 1.14.0 Signed-off-by: Gareth Anderson <gareth.anderson03@gmail.com>
Added plugins line as velero install failed without this option in version 1.14.0 Removed the volumesnapshotlocation as it does not exist in 1.14.0 Signed-off-by: Joe Beda <gareth.anderson03@gmail.com>
Added plugins line as velero install failed without this option in version 1.14.0 Removed the volumesnapshotlocation as it does not exist in 1.14.0 Signed-off-by: Gareth Anderson <gareth.anderson03@gmail.com>
Added option checksumAlgorith, this stops 403 errors as per vmware-tanzu#7543 Added plugins line as velero install failed without this option in version 1.14.0 Removed the volumesnapshotlocation as it does not exist in 1.14.0 Signed-off-by: Gareth Anderson <gareth.anderson03@gmail.com>
Just adding my voice to this, we ran into it using the Replicated backup tools (which are Velero under the covers) to DigitalOcean's S3-compatible "Spaces". Setting |
@RangerRick "Setting checksumAlgorithm: "" on the BackupStorageLocation resource fixed it for us too, but I'm not able to twiddle that for the restore." -- I'm not sure what you mean there. If it's set on the BSL, then that setting is in use for any operation that accesses the object store -- backup, restore, etc. |
@sseago I have checksumAlgorithm set to "" in BSL and
|
@sseago Sorry, specifically in Replicated's tools, which automate the entire disaster recovery process, so I have no way to hook into the space between the pulling of the metadata and when they start the restore. I'm sure they could work around it too, but it would be nice if the s3 plugin could negotiate these things clearly and transparently in the first place. |
The s3 plugin is configured via the BackupStorageLocation. This field is included there. If you are unable to configure the BSL completely (including this parameter) via Replicated tools, then you may need to open an issue against Replicated. Since this is part of the BackupStorageLocation configuration, Velero takes configuration from there. If |
Thanks @sseago since the original issue is to be resolved on IBM side per this comment |
Discussed in #7542
Originally posted by Wayne-H-Ha March 19, 2024
We used to be able to create backup using velero 1.12.2 and aws plugin 1.8.2.
We tried velero 1.13.0 and plugin 1.9.0 and it failed so we switched back to older version.
We tried again with velero 1.13.1 and plugin 1.9.1 and it still fails. Any configuration change we need to make in order to use the new version?
We tried to find the backup in s3 and it didn't get uploaded there.
When we describe the backup, it returns:
We believe the problem is a suffix "@aws" is added to key id? For example, aws_access_key_id = "3..0" but "3..0@aws" is passed to s3? Is there a configuration we can use to not having this suffix added?
The text was updated successfully, but these errors were encountered: