-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Using SOPS Sync to Apply takes over 600 seconds to run #3186
Comments
This looks to be a bug with SOPS and GPG specifically. The GPG golang gets into a bad loop around asking for passphrase but doesn't actually ask. |
We do not support passphrased GPG keys so I would say that is semi-expected. |
I'm seeing the same issue (I use SOPS together with Flux), and it's way too slow to be usable. I think this started happening after I added the first secret encrypted with SOPS in the Git repo, before that everything looked good and worked as expected. Can someone help with this, any advice? |
This is a bug in SOPS :( but it looks like it was just fixed 20 hours ago?! so maybe we can get this upstreamed soon? getsops/sops@4f06780 |
Not sure if it's worth creating the PR. I tried downloading the latest sops binary (v3.6.0) and manually replaced it in the flux container. |
It doesn’t use the sops binary. It uses the golang library.
…Sent from my iPhone
On Jul 16, 2020, at 14:02, Adrian Ursu ***@***.***> wrote:
Not sure if it's worts creating the PR. I tried downloading the latest sops binary (v3.6.0) and manually replacing it in the flux container.
The first secret gets applied successfully, but at the second one, Flux hangs, the same as before.
I'll try and do some more tests to try to figure out where the problem is.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Ah, yes. I've just seen the binary in the container and I assumed it used it. So I built a docker image from master, upgrading the SOPS version to 3.6.0 (the code is here) and pushed the docker image to dockerhub to give it a quick try, but I'm still seeing the same issue. Curious if maybe @PaulFarver who submitted the initial functionality is having the same issue. |
We don't see this issue. We have around 30 out of 105 manifests encrypted with a GPG key present in the container, and it typically takes a couple of seconds to sync. We haven't tried other ways of supplying the key. Are you both (@ursuad and @ekristen) using AWS KMS? How many manifests overall do you sync? |
Just 15 from what I can see:
For me Flux just times out, it doesn't complete. Inside the repo that flux pulls, I have a
I create a secret with the GPG key in k8s. Then I pass the secret name to the flux chart under The key gets mounted here: Also, adding the encrypted secret manually on the pod and using the local sops binary to do decrypt the file, works well. |
@ursuad your setup is very similar to ours, but It looks like your issue is different from the one described in the issue. This issue describes a long hangtime at
Your problem should not be related to sops, since everything is already decrypted, once you get to |
Ah, OK, great, thanks Paul. Then it might be related to something else, I'll have another look. |
@PaulFarver What I experience locally with SOPS on MacOS for example is that after a single decryption it can't decrypt anymore because it get's stuck in a loop and pegs the CPU. I tore Flux apart and put a bunch of debugging code in place and found that it was indeed hanging on the SOPS decryption part. Eventually it is successful but generally takes 5+ minutes. From sync.go:73 -> trying to sync git changes to sync.go:539 is the delay because it's within this time it's having issues. Now it is important for me to mention that I am using key groups with partial file encryption. In one set of key groups it's PGP keys in the other it's AWS KMS keys. I have a branch locally upgraded to the latest SOPS but I haven't had a chance to actually test it yet. |
@PaulFarver if you don't think it is SOPS what do you think it could be? From my testing (locally) it definitely seems to be, but happy to look into other possibilities? |
@PaulFarver I see it now. Doh. |
It looks like upgrading to SOPS 3.6 doesn't really fix the issue either. There's something more going on. It does look like the SOPS 3.6 fix fixes the PGP issue using the CLI tool though. |
For some reason is hangs here flux/pkg/cluster/kubernetes/resource/load.go Line 219 in 64092dd
|
@ekristen thanks for digging into this. 👍 I find it very odd, that this only happens when using the library function. I have tried to look further into sops, and I'm pretty sure it is related to AWS KMS in sops, but it would be great, to have some more context on this.
Can you share more on your sops configuration? I'd like to try to reproduce this. |
It could be AWS KMS. I am also doing partial file encryption. I have 3 key groups.
Only in the context of flux with sops do I see this behavior. In deployed environment I’m using instance profile with role assumption to decrypt the secrets. With flux for 7 resources it take ~600 seconds. With sops binary on same host 1.5 seconds per resource at most. |
My assumption that it was PGP related was circumstantial based on experience with CLI tool getting stuck. How are you doing local development? All the docs are out of date and seems like recent changes has made it impossible to provide a kubeconfig directly to make dev easier. I’d be happy to debug further but got tired of trying to get a local dev env working :( |
I was having an issue with decryption so I went into the flux container to run sops and for a single run was able to see with verbose mode that after it failed the first key group (which is just PGP) it hung when moving to the second key group, with no output. All subsequent runs ran almost immediately. It appears it could be KMS getting stuck. |
@PaulFarver it seems that without PGP defined, KMS decryption is very fast and doesn;t seem to get stuck, so this appears to be an issue when both are being used and perhaps only shows when PGP is unsuccessful. |
@PaulFarver have you gotten anywhere with this? |
@ekristen Sorry. I've been on vacation. I've just given it a shot, but I can't reproduce the problem. 😕 This is my .sops.yaml configuration: creation_rules:
- shamir_threshold: 2
encrypted_regex: "data"
key_groups:
- pgp:
- [fingerprint]
- kms:
- arn: [arn1]
- kms:
- arn: [arn2] I've tried both with the cli and using the go package. This is the code i use to test it in go: package main
import (
"fmt"
"io/ioutil"
"go.mozilla.org/sops/v3/decrypt"
)
func main() {
rawData, err := ioutil.ReadFile("./encrypted.yaml")
if err != nil {
panic(err)
}
decrypted, err := decrypt.Data(rawData, "yaml")
if err != nil {
panic(err)
}
fmt.Print(string(decrypted))
} As you say it does seem like a problem with sops itself, so If you can provide steps to reproduce in a minimal setup like this, I'd suggest opening an issue over at mozilla/sops |
@PaulFarver one big difference off the get go I see is that I'm using AssumeRole. Everything is fast locally 1-2 seconds max for a decrypt, but once am in flux in AWS, 10+ minutes for 8 secrets. Ugh. creation_rules:
- encrypted_regex: "^(data|stringData)$"
shamir_threshold: 2
key_groups:
- kms:
- arn: arn:aws:kms:us-east-1:000000000000:key/00000000-697c-47c5-b03b-000000000000
role: arn:aws:iam::000000000000:role/role-to-use-key
- kms:
- arn: arn:aws:kms:us-west-2:111111111111:key/00000000-9987-4061-9d62-000000000000
role: arn:aws:iam::111111111111:role/role-to-use-key If I could figure out how to dev flux locally I would debug this, but the dev docs are out of date and it seems like they ripped out all the bits that make developing locally possible and I haven't taken the time to figure out how to make it work. |
Actually looks like dev docs work now? IDK maybe I was working off old. I'm going to try and get to the bottom of this. |
After testing this is what I've found.
Since the CLI works ... I have to think this is something to do with the implementation of SOPS in Flux. |
@ekristen could you please test my PR? |
@stefanprodan I will ... however the only files being sync'd are sops encrypted and it's only 7 files. It really does appear to be something with how SOPS is implemented. From what I can tell so far it seems like the SOPS binary calls a slightly different decryption path than how it's implemented in Flux. I can't reproduce locally, seems to only happen when in AWS in a container, so I'm trying to figure out how to debug. I'm working on setting up a dev server in AWS to see if I can reproduce. |
Ok I finally reproduced! The bug is when using SOPS encryption with EC2 Instance Profiles and Assume Role the system takes forever, but the SOPS binary using the same auth method is quick. @stefanprodan I'm building the docker image now to push to my registry to test your change for good measure. |
I've reproduced this as well with a .sops.yaml file very similar to @ekristen's. As soon as I add a newly encrypted file it suddenly takes ~40 seconds longer to sync the files on every cycle. (it typically only takes a couple of seconds) I also tried it in a clean golang pod in the same cluster with the simple main.go from #3186 (comment), but it only takes a couple of seconds to decrypt it there. It's really odd to me, since it's the same implementation as in flux. |
@ekristen could this be related to the aws sdk version? Flux is on v1.27.1 and sops on v1.23.13 https://github.com/mozilla/sops/blob/v3.5.0/go.mod#L7 Flux is using the sops library, it should behave the same with the sops CLI. Great job finding a way to reproduce this! 🎉 |
It could be, it could also be the way that Flux is calling the decrypt, it seems to be different from how the SOPS binary works, but I'm still reversing the code to understand how it's different. The SOPS cli uses https://github.com/mozilla/sops/blob/master/cmd/sops/decrypt.go#L22 to decrypt the file which is not using https://github.com/mozilla/sops/blob/master/decrypt/decrypt.go which is what Flux uses. In testing it appears that things get hung here https://github.com/mozilla/sops/blob/master/decrypt/decrypt.go#L42 but I haven't figured out why yet. If its the AWS library, I should be able to reproduce by upgrading SOPS to 1.27.1 compiling and testing, let me do that. |
It doesn't appear to be the AWS library, I upgraded the SOPS source to 1.27.1 and built it, then copied into my flux container that shows the slow decrypt and ran it against one of the files and it was so I'm back to how flux implements. I'm spinning up an AWS instances for developing on Flux right now. |
Interesting, I cannot reproduce developing this on ubuntu with go 1.14, so maybe it's related to something weird with alpine. |
The only place I can reproduce this is running flux inside a kubernetes cluster on AWS using Instance Profiles for authentication and only with the flux daemon running in a container. I am using 23 yaml files now, it takes roughly ~1 second per file when using the SOPS CLI to decrypt the files, but still 10+ minutes for flux directly.
I'm up to 16 minutes between when it syncs and tries to apply |
Things just keep getting weirder. I decided to dive into the internals of the cluster and the node and started monitoring the FLUX pod network interface and I'm seeing a TON of TCP spurious re-transmissions when FLUX is trying to talk to amazon ec2 metadata API. Where as monitoring the network when the SOPS cli is used is much different. Additionally I am seeing an entirely different call flow when SOPS cli is used vs FLUXD. Different ec2 metadata endpoints are being used. |
Interesting, I've narrowed it down to the following. When flux uses AWS KMS via SOPS to decrypt the When SOPS CLI is called directly no such call to |
@stefanprodan I do think this could be related to the AWS SDK now. I only tested upgrading SOPS not downgrading Flux. It seems that in version v1.25.38 https://github.com/aws/aws-sdk-go/blob/master/CHANGELOG.md#sdk-enhancements-21 they changed the default ec2 behavior which brings in this new api endpoint that I'm seeing above. |
Wow just wow. So basically AWS broke backwards compatibility in a number of ways. The crux of the issue is the newer AWS SDK that turns on IMDSv2 with no way to turn it off and then they set the default hop limit to 1 on all instances by default. Ultimately they are trying to make things more secure by screwing a lot of people over. See aws/aws-sdk-go#2980 and aws/aws-sdk-go#2972
It takes about 12 seconds now! Wow, what a rabbit hole! TL;DR --- AWS introduced a BREAKING change to their SDK on newer versions. @stefanprodan how do you feel about downgrading the AWS SDK? |
@ekristen we've fixed this in eksctl here by setting launchTemplateData := &gfnec2.LaunchTemplate_LaunchTemplateData{
IamInstanceProfile: &gfnec2.LaunchTemplate_IamInstanceProfile{
Arn: n.instanceProfileARN,
},
ImageId: gfnt.NewString(n.spec.AMI),
UserData: n.userData,
NetworkInterfaces: []gfnec2.LaunchTemplate_NetworkInterface{{
// Explicitly un-setting this so that it doesn't get defaulted to true
AssociatePublicIpAddress: nil,
DeviceIndex: gfnt.NewInteger(0),
Groups: gfnt.NewSlice(n.securityGroups...),
}},
MetadataOptions: &gfnec2.LaunchTemplate_MetadataOptions{
HttpPutResponseHopLimit: gfnt.NewInteger(2),
HttpTokens: gfnt.NewString(imdsv2TokensRequired),
},
} That explains why I haven't seen the delay on my test clusters.... |
That’s a great tip but I don’t use EKS and the current method used to spin up ec2 instances doesn’t provide an option to modify the metadata configuration unfortunately. If downgrade isn’t an option, I’ll have to look into writing a script to modify instance metadata options on the fly. At least we know what was going on and why you couldn’t reproduce. It’s unfortunate that AWS says IMDSv2 is optional still yet they force you to use it in their official sdks. |
There are fixes for the Flux v1 daemon that include SOPS upgrades (there is another one in #3514 which will likely get included in Flux v1.24.0 coming out soon.) We have also done many upgrades that include a workaround for issues when the AWS Metadata API cannot be reached. #3015 - that should be fixed in the upcoming release as well, or you can check one of the I have low expectations that anyone is hanging around waiting for this issue, as Flux v2 has much better support for SOPS and debugging and inspection, as well as everything else these days, but I wanted to write a complete response before closing it. I am not sure how to fix this. I have held off on some AWS SDK updates for now, because they did not apply cleanly, although they seem to be applied cleanly in kingdonb#66 and might make it into the next upgrade of Flux v1 in maintenance mode. At this time, we hope you have already begun migration away from Flux v1 / legacy over to the current version of Flux. Please note that Flux v1 remains in maintenance mode. Bugs with Flux v1 can be addressed as long as it remains in maintenance, according to the plan laid out in the Flux Migration Timetable. 👍 Thanks for using Flux! |
Describe the bug
I am using SOPS with Flux and it takes about 5 minutes from the time the daemon says it's "trying to sync git changes to the cluster" to when the "sync" apply is logged.
To Reproduce
I am using AWS KMS with SOPS, I have 7 secrets.
Decrypting locally takes 1 second per secret, decrypting in the flux container takes about 1 second per secret, yet when flux runs and uses SOPS it is taking over 600 seconds.
Expected behavior
A clear and concise description of what you expected to happen.
Logs
It takes 600+ seconds to sync 7 secrets?
Additional context
The text was updated successfully, but these errors were encountered: