Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Using SOPS Sync to Apply takes over 600 seconds to run #3186

Closed
ekristen opened this issue Jul 9, 2020 · 44 comments
Closed

Using SOPS Sync to Apply takes over 600 seconds to run #3186

ekristen opened this issue Jul 9, 2020 · 44 comments
Assignees
Labels
blocked-needs-validation Issue is waiting to be validated before we can proceed bug

Comments

@ekristen
Copy link

ekristen commented Jul 9, 2020

Describe the bug

I am using SOPS with Flux and it takes about 5 minutes from the time the daemon says it's "trying to sync git changes to the cluster" to when the "sync" apply is logged.

To Reproduce

I am using AWS KMS with SOPS, I have 7 secrets.

Decrypting locally takes 1 second per secret, decrypting in the flux container takes about 1 second per secret, yet when flux runs and uses SOPS it is taking over 600 seconds.

Expected behavior

A clear and concise description of what you expected to happen.

Logs

It takes 600+ seconds to sync 7 secrets?

ts=2020-07-09T22:57:59.978015617Z caller=sync.go:73 component=daemon info="trying to sync git changes to the cluster" old=1f65bff30457753268cb055e97c0e97f2aa13084 new=1f65bff30457753268cb055e97c0e97f2aa13084
ts=2020-07-09T23:02:51.230198126Z caller=sync.go:539 method=Sync cmd=apply args= count=7
ts=2020-07-09T23:02:51.892497125Z caller=sync.go:605 method=Sync cmd="kubectl apply -f -" took=662.221356ms 

Additional context

  • Flux version: 1.19.0
  • Kubernetes version: 1.17.6
  • Git provider: GitLab
  • Container registry provider: Docker.io
@ekristen ekristen added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Jul 9, 2020
@ekristen
Copy link
Author

This looks to be a bug with SOPS and GPG specifically. The GPG golang gets into a bad loop around asking for passphrase but doesn't actually ask.

@hiddeco
Copy link
Member

hiddeco commented Jul 10, 2020

We do not support passphrased GPG keys so I would say that is semi-expected.

@ursuad
Copy link

ursuad commented Jul 16, 2020

I'm seeing the same issue (I use SOPS together with Flux), and it's way too slow to be usable.
The key doesn't have any passphrase, can use it locally perfectly fine, and I don't see any errors in the Flux logs.

I think this started happening after I added the first secret encrypted with SOPS in the Git repo, before that everything looked good and worked as expected.

Can someone help with this, any advice?

@ekristen
Copy link
Author

This is a bug in SOPS :( but it looks like it was just fixed 20 hours ago?! so maybe we can get this upstreamed soon? getsops/sops@4f06780

@ursuad
Copy link

ursuad commented Jul 16, 2020

Ah, yes, was just looking at that. Found something similar.
I think the last release has the merge here with the original issue being tracked here.

Latest SOPS release (3.6.0) was 20h ago, I suspect just upgrading to the new version here should fix it.
I'll create a PR

@ursuad
Copy link

ursuad commented Jul 16, 2020

Not sure if it's worth creating the PR. I tried downloading the latest sops binary (v3.6.0) and manually replaced it in the flux container.
The first SOPS secret gets applied successfully, but at the second one, Flux hangs, the same as before.
I'll try and do some more tests to try to figure out where the problem is.

@ekristen
Copy link
Author

ekristen commented Jul 16, 2020 via email

@ursuad
Copy link

ursuad commented Jul 17, 2020

Ah, yes. I've just seen the binary in the container and I assumed it used it.

So I built a docker image from master, upgrading the SOPS version to 3.6.0 (the code is here) and pushed the docker image to dockerhub to give it a quick try, but I'm still seeing the same issue.

Curious if maybe @PaulFarver who submitted the initial functionality is having the same issue.

@PaulFarver
Copy link
Contributor

Curious if maybe @PaulFarver who submitted the initial functionality is having the same issue.

We don't see this issue. We have around 30 out of 105 manifests encrypted with a GPG key present in the container, and it typically takes a couple of seconds to sync. We haven't tried other ways of supplying the key. Are you both (@ursuad and @ekristen) using AWS KMS? How many manifests overall do you sync?

@ursuad
Copy link

ursuad commented Jul 17, 2020

Just 15 from what I can see:

flux-698979d747-stvtj flux ts=2020-07-17T09:19:14.410282013Z caller=sync.go:60 component=daemon info="trying to sync git changes to the cluster" old=53f3b23fe3488091c190f75ce2b409b522152d9c new=a69a63e85802031d100d6c71b3cf4a016b8b4e9f
flux-698979d747-stvtj flux ts=2020-07-17T09:19:15.851499946Z caller=sync.go:540 method=Sync cmd=apply args= count=15

For me Flux just times out, it doesn't complete.

Inside the repo that flux pulls, I have a .sops.yaml to tell sops how to encrypt the files locally. Also, it might be worth mentioning, that I'm only encrypting part of the file, k8s secrets, using encrypted_regex.

  - path_regex: \.enc\.yaml
    encrypted_regex: '^(data|stringData)$'

I create a secret with the GPG key in k8s. Then I pass the secret name to the flux chart under gpgKeys.secretName ( These keys will be imported into GPG in the Flux container.)

The key gets mounted here: /root/gpg-import/private/ and is imported correctly, as it shows under gpg --list-keys.

Also, adding the encrypted secret manually on the pod and using the local sops binary to do decrypt the file, works well.

@PaulFarver
Copy link
Contributor

@ursuad your setup is very similar to ours, but It looks like your issue is different from the one described in the issue.

This issue describes a long hangtime at trying to sync git changes to the cluster, as seen in the log times from the description:

ts=2020-07-09T22:57:59.978015617Z caller=sync.go:73 component=daemon info="trying to sync git changes to the cluster" old=1f65bff30457753268cb055e97c0e97f2aa13084 new=1f65bff30457753268cb055e97c0e97f2aa13084
ts=2020-07-09T23:02:51.230198126Z caller=sync.go:539 method=Sync cmd=apply args= count=7

Your problem should not be related to sops, since everything is already decrypted, once you get to method=Sync cmd=apply args= count=15 to my knowledge.

@ursuad
Copy link

ursuad commented Jul 17, 2020

Ah, OK, great, thanks Paul. Then it might be related to something else, I'll have another look.

@ekristen
Copy link
Author

@PaulFarver What I experience locally with SOPS on MacOS for example is that after a single decryption it can't decrypt anymore because it get's stuck in a loop and pegs the CPU.

I tore Flux apart and put a bunch of debugging code in place and found that it was indeed hanging on the SOPS decryption part. Eventually it is successful but generally takes 5+ minutes.

From sync.go:73 -> trying to sync git changes to sync.go:539 is the delay because it's within this time it's having issues.


Now it is important for me to mention that I am using key groups with partial file encryption. In one set of key groups it's PGP keys in the other it's AWS KMS keys.

I have a branch locally upgraded to the latest SOPS but I haven't had a chance to actually test it yet.

@ekristen
Copy link
Author

@PaulFarver if you don't think it is SOPS what do you think it could be? From my testing (locally) it definitely seems to be, but happy to look into other possibilities?

@PaulFarver
Copy link
Contributor

@ekristen Just to clarify, I was referring to the issue @ursuad is having, when I said this was not related to sops, which seems to be different from yours

@ekristen
Copy link
Author

@PaulFarver I see it now. Doh.

@ekristen
Copy link
Author

ekristen commented Jul 17, 2020

It looks like upgrading to SOPS 3.6 doesn't really fix the issue either. There's something more going on.

It does look like the SOPS 3.6 fix fixes the PGP issue using the CLI tool though.

@ekristen
Copy link
Author

For some reason is hangs here

decryptedData, err := decrypt.Data(rawData, "yaml")
and diving deeper into the SOPS code it appears here more specifically https://github.com/mozilla/sops/blob/master/decrypt/decrypt.go#L42

@PaulFarver
Copy link
Contributor

@ekristen thanks for digging into this. 👍

I find it very odd, that this only happens when using the library function. I have tried to look further into sops, and I'm pretty sure it is related to AWS KMS in sops, but it would be great, to have some more context on this.

Now it is important for me to mention that I am using key groups with partial file encryption. In one set of key groups it's PGP keys in the other it's AWS KMS keys.

Can you share more on your sops configuration? I'd like to try to reproduce this.

@ekristen
Copy link
Author

ekristen commented Jul 22, 2020

It could be AWS KMS.

I am also doing partial file encryption.

I have 3 key groups.

  • PGP keys
  • KMS key
  • KMS key

Only in the context of flux with sops do I see this behavior.

In deployed environment I’m using instance profile with role assumption to decrypt the secrets.

With flux for 7 resources it take ~600 seconds. With sops binary on same host 1.5 seconds per resource at most.

@ekristen
Copy link
Author

My assumption that it was PGP related was circumstantial based on experience with CLI tool getting stuck.

How are you doing local development? All the docs are out of date and seems like recent changes has made it impossible to provide a kubeconfig directly to make dev easier. I’d be happy to debug further but got tired of trying to get a local dev env working :(

@ekristen
Copy link
Author

I was having an issue with decryption so I went into the flux container to run sops and for a single run was able to see with verbose mode that after it failed the first key group (which is just PGP) it hung when moving to the second key group, with no output.

All subsequent runs ran almost immediately.

It appears it could be KMS getting stuck.

@ekristen
Copy link
Author

@PaulFarver it seems that without PGP defined, KMS decryption is very fast and doesn;t seem to get stuck, so this appears to be an issue when both are being used and perhaps only shows when PGP is unsuccessful.

@ekristen
Copy link
Author

ekristen commented Aug 6, 2020

@PaulFarver have you gotten anywhere with this?

@PaulFarver
Copy link
Contributor

@ekristen Sorry. I've been on vacation.

I've just given it a shot, but I can't reproduce the problem. 😕

This is my .sops.yaml configuration:

creation_rules:
  - shamir_threshold: 2
    encrypted_regex: "data"
    key_groups:
    - pgp: 
      - [fingerprint]
    - kms:
      - arn: [arn1]
    - kms:
      - arn: [arn2]

I've tried both with the cli and using the go package. This is the code i use to test it in go:

package main

import (
	"fmt"
	"io/ioutil"

	"go.mozilla.org/sops/v3/decrypt"
)

func main() {
	rawData, err := ioutil.ReadFile("./encrypted.yaml")
	if err != nil {
		panic(err)
	}
	decrypted, err := decrypt.Data(rawData, "yaml")
	if err != nil {
		panic(err)
	}
	fmt.Print(string(decrypted))
}

As you say it does seem like a problem with sops itself, so If you can provide steps to reproduce in a minimal setup like this, I'd suggest opening an issue over at mozilla/sops

@ekristen
Copy link
Author

@PaulFarver one big difference off the get go I see is that I'm using AssumeRole. Everything is fast locally 1-2 seconds max for a decrypt, but once am in flux in AWS, 10+ minutes for 8 secrets. Ugh.

creation_rules:
  - encrypted_regex: "^(data|stringData)$"
    shamir_threshold: 2
    key_groups:
      - kms:
          - arn: arn:aws:kms:us-east-1:000000000000:key/00000000-697c-47c5-b03b-000000000000
            role: arn:aws:iam::000000000000:role/role-to-use-key
      - kms:
          - arn: arn:aws:kms:us-west-2:111111111111:key/00000000-9987-4061-9d62-000000000000
            role: arn:aws:iam::111111111111:role/role-to-use-key

If I could figure out how to dev flux locally I would debug this, but the dev docs are out of date and it seems like they ripped out all the bits that make developing locally possible and I haven't taken the time to figure out how to make it work.

@ekristen
Copy link
Author

Actually looks like dev docs work now? IDK maybe I was working off old. I'm going to try and get to the bottom of this.

@ekristen
Copy link
Author

ekristen commented Aug 14, 2020

After testing this is what I've found.

  • Baremetal (PGP & KMS w/ Role w/ AWS ENV KEYS) - SOPS Binary from Flux Container - FAST
  • Baremetal (KMS w/ Role w/ AWS ENV KEYS) - Flux Daemon w/ SOPS - FAST
  • AWS Cloud (PGP & KMS w/ Role w/ IAM Profile) - SOPS Binary from Flux Container - FAST
  • AWS Cloud (PGP & KMS w/ Role w/ IAM Profile) - Flux Daemon w/ SOPS - SLOW

Since the CLI works ... I have to think this is something to do with the implementation of SOPS in Flux.

@stefanprodan
Copy link
Member

@ekristen could you please test my PR?

@ekristen
Copy link
Author

ekristen commented Aug 15, 2020

@stefanprodan I will ... however the only files being sync'd are sops encrypted and it's only 7 files. It really does appear to be something with how SOPS is implemented. From what I can tell so far it seems like the SOPS binary calls a slightly different decryption path than how it's implemented in Flux.

I can't reproduce locally, seems to only happen when in AWS in a container, so I'm trying to figure out how to debug. I'm working on setting up a dev server in AWS to see if I can reproduce.

@ekristen
Copy link
Author

Ok I finally reproduced! The bug is when using SOPS encryption with EC2 Instance Profiles and Assume Role the system takes forever, but the SOPS binary using the same auth method is quick.

@stefanprodan I'm building the docker image now to push to my registry to test your change for good measure.

@PaulFarver
Copy link
Contributor

PaulFarver commented Aug 15, 2020

I've reproduced this as well with a .sops.yaml file very similar to @ekristen's. As soon as I add a newly encrypted file it suddenly takes ~40 seconds longer to sync the files on every cycle. (it typically only takes a couple of seconds)

I also tried it in a clean golang pod in the same cluster with the simple main.go from #3186 (comment), but it only takes a couple of seconds to decrypt it there. It's really odd to me, since it's the same implementation as in flux.

@stefanprodan
Copy link
Member

@ekristen could this be related to the aws sdk version? Flux is on v1.27.1 and sops on v1.23.13 https://github.com/mozilla/sops/blob/v3.5.0/go.mod#L7 Flux is using the sops library, it should behave the same with the sops CLI.

Great job finding a way to reproduce this! 🎉

@ekristen
Copy link
Author

It could be, it could also be the way that Flux is calling the decrypt, it seems to be different from how the SOPS binary works, but I'm still reversing the code to understand how it's different.

The SOPS cli uses https://github.com/mozilla/sops/blob/master/cmd/sops/decrypt.go#L22 to decrypt the file which is not using https://github.com/mozilla/sops/blob/master/decrypt/decrypt.go which is what Flux uses.

In testing it appears that things get hung here https://github.com/mozilla/sops/blob/master/decrypt/decrypt.go#L42 but I haven't figured out why yet.

If its the AWS library, I should be able to reproduce by upgrading SOPS to 1.27.1 compiling and testing, let me do that.

@ekristen
Copy link
Author

It doesn't appear to be the AWS library, I upgraded the SOPS source to 1.27.1 and built it, then copied into my flux container that shows the slow decrypt and ran it against one of the files and it was so I'm back to how flux implements. I'm spinning up an AWS instances for developing on Flux right now.

@ekristen
Copy link
Author

Interesting, I cannot reproduce developing this on ubuntu with go 1.14, so maybe it's related to something weird with alpine.

@ekristen
Copy link
Author

ekristen commented Aug 15, 2020

The only place I can reproduce this is running flux inside a kubernetes cluster on AWS using Instance Profiles for authentication and only with the flux daemon running in a container. I am using 23 yaml files now, it takes roughly ~1 second per file when using the SOPS CLI to decrypt the files, but still 10+ minutes for flux directly.

ts=2020-08-15T18:49:51.531572388Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@localhost/secrets branch=master HEAD=50da81d6b2685aba2c6ab7c6f2353516beb0980e
ts=2020-08-15T18:50:22.019959431Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@localhost/secrets branch=master HEAD=50da81d6b2685aba2c6ab7c6f2353516beb0980e
ts=2020-08-15T18:50:53.195133036Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@localhost/secrets branch=master HEAD=50da81d6b2685aba2c6ab7c6f2353516beb0980e
ts=2020-08-15T18:51:24.412777618Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@localhost/secrets branch=master HEAD=50da81d6b2685aba2c6ab7c6f2353516beb0980e
ts=2020-08-15T18:51:51.533534834Z caller=sync.go:61 component=daemon info="trying to sync git changes to the cluster" old=50da81d6b2685aba2c6ab7c6f2353516beb0980e new=50da81d6b2685aba2c6ab7c6f2353516beb0980e
ts=2020-08-15T19:07:42.815749269Z caller=sync.go:540 method=Sync cmd=apply args= count=23
ts=2020-08-15T19:07:43.681487541Z caller=sync.go:606 method=Sync cmd="kubectl apply -f -" took=865.658131ms err=null output="secret/one unchanged\nsecret/two unchanged\nsecret/three unchanged\nsecret/four unchanged\nsecret/five unchanged\nsecret/six unchanged\nsecret/seven unchanged\nsecret/eight unchanged\nsecret/nine unchanged\nsecret/ten unchanged\nsecret/eleven unchanged\nsecret/twelve unchanged\nsecret/thirteen unchanged\nsecret/fourteen unchanged\nsecret/fifteen unchanged\nsecret/sixteen unchanged\nsecret/seventeen unchanged\nsecret/eighteen unchanged\nsecret/nineteen unchanged\nsecret/twenty configured\nsecret/twenty-one unchanged\nsecret/twenty-two unchanged\nsecret/twenty-three unchanged"

I'm up to 16 minutes between when it syncs and tries to apply

@ekristen
Copy link
Author

Things just keep getting weirder. I decided to dive into the internals of the cluster and the node and started monitoring the FLUX pod network interface and I'm seeing a TON of TCP spurious re-transmissions when FLUX is trying to talk to amazon ec2 metadata API. Where as monitoring the network when the SOPS cli is used is much different.

Additionally I am seeing an entirely different call flow when SOPS cli is used vs FLUXD. Different ec2 metadata endpoints are being used.

@ekristen
Copy link
Author

Interesting, I've narrowed it down to the following.

When flux uses AWS KMS via SOPS to decrypt the ec2-metadat/latest/api/token endpoint gets called, for whatever reason, this takes FOREVER to return a response.

When SOPS CLI is called directly no such call to ec2-metadata/latest/api/token is made.

@ekristen
Copy link
Author

@stefanprodan I do think this could be related to the AWS SDK now. I only tested upgrading SOPS not downgrading Flux.

It seems that in version v1.25.38 https://github.com/aws/aws-sdk-go/blob/master/CHANGELOG.md#sdk-enhancements-21 they changed the default ec2 behavior which brings in this new api endpoint that I'm seeing above.

@ekristen
Copy link
Author

Wow just wow.

So basically AWS broke backwards compatibility in a number of ways. The crux of the issue is the newer AWS SDK that turns on IMDSv2 with no way to turn it off and then they set the default hop limit to 1 on all instances by default. Ultimately they are trying to make things more secure by screwing a lot of people over.

See aws/aws-sdk-go#2980 and aws/aws-sdk-go#2972

aws ec2 modify-instance-metadata-options --instance-id INSTANCE_ID --http-endpoint enabled --http-put-response-hop-limit 2

It takes about 12 seconds now! Wow, what a rabbit hole!

TL;DR ---

AWS introduced a BREAKING change to their SDK on newer versions.


@stefanprodan how do you feel about downgrading the AWS SDK?

@stefanprodan
Copy link
Member

stefanprodan commented Aug 16, 2020

@ekristen we've fixed this in eksctl here by setting HttpPutResponseHopLimit: 2 for all Kubernetes nodes https://github.com/weaveworks/eksctl/blob/master/pkg/cfn/builder/nodegroup.go#L269-L272

	launchTemplateData := &gfnec2.LaunchTemplate_LaunchTemplateData{
		IamInstanceProfile: &gfnec2.LaunchTemplate_IamInstanceProfile{
			Arn: n.instanceProfileARN,
		},
		ImageId:  gfnt.NewString(n.spec.AMI),
		UserData: n.userData,
		NetworkInterfaces: []gfnec2.LaunchTemplate_NetworkInterface{{
			// Explicitly un-setting this so that it doesn't get defaulted to true
			AssociatePublicIpAddress: nil,
			DeviceIndex:              gfnt.NewInteger(0),
			Groups:                   gfnt.NewSlice(n.securityGroups...),
		}},
		MetadataOptions: &gfnec2.LaunchTemplate_MetadataOptions{
			HttpPutResponseHopLimit: gfnt.NewInteger(2),
			HttpTokens:              gfnt.NewString(imdsv2TokensRequired),
		},
	}

That explains why I haven't seen the delay on my test clusters....

@ekristen
Copy link
Author

ekristen commented Aug 16, 2020

That’s a great tip but I don’t use EKS and the current method used to spin up ec2 instances doesn’t provide an option to modify the metadata configuration unfortunately.

If downgrade isn’t an option, I’ll have to look into writing a script to modify instance metadata options on the fly.

At least we know what was going on and why you couldn’t reproduce.

It’s unfortunate that AWS says IMDSv2 is optional still yet they force you to use it in their official sdks.

@kingdonb kingdonb self-assigned this Feb 20, 2021
@kingdonb
Copy link
Member

There are fixes for the Flux v1 daemon that include SOPS upgrades (there is another one in #3514 which will likely get included in Flux v1.24.0 coming out soon.)

We have also done many upgrades that include a workaround for issues when the AWS Metadata API cannot be reached. #3015 - that should be fixed in the upcoming release as well, or you can check one of the flux-prerelease images that were built this morning to try and see if it could help. If you are still struggling with this issue, please try the last release of Flux v1.

I have low expectations that anyone is hanging around waiting for this issue, as Flux v2 has much better support for SOPS and debugging and inspection, as well as everything else these days, but I wanted to write a complete response before closing it.

I am not sure how to fix this. I have held off on some AWS SDK updates for now, because they did not apply cleanly, although they seem to be applied cleanly in kingdonb#66 and might make it into the next upgrade of Flux v1 in maintenance mode.

At this time, we hope you have already begun migration away from Flux v1 / legacy over to the current version of Flux.

Please note that Flux v1 remains in maintenance mode. Bugs with Flux v1 can be addressed as long as it remains in maintenance, according to the plan laid out in the Flux Migration Timetable. 👍

Thanks for using Flux!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked-needs-validation Issue is waiting to be validated before we can proceed bug
Projects
None yet
Development

No branches or pull requests

6 participants