Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS_MSK_IAM: kept getting GROUP_AUTHORIZATION_FAILED after several hours of running #147

Closed
gunturaf opened this issue Mar 17, 2022 · 8 comments

Comments

@gunturaf
Copy link

I use AWS_MSK_IAM for the SASL method, I also set the AWS credentials to also be cached using NewCredentialsCache then the SASL option looks like so:

clientName := "my-app"

// read AWS from local profile:
awsConf, err := config.LoadDefaultConfig(ctx)
if err != nil {
	panic(err)
}

// AWS cached credentials like so:
credProvider := aws.NewCredentialsCache(
				awsConf.Credentials,
				func(options *aws.CredentialsCacheOptions) {
					options.ExpiryWindow = 20 * time.Second
					options.ExpiryWindowJitterFrac = 0.5
				},
			)

// the SASL option looks like this:
kgo.SASL(
	aws.ManagedStreamingIAM(func(ctx context.Context) (aws.Auth, error) {
		val, err2 := credProvider.Retrieve(ctx)
		if err2 != nil {
			return aws.Auth{}, err2
		}
		return aws.Auth{
			AccessKey:    val.AccessKeyID,
			SecretKey:    val.SecretAccessKey,
			SessionToken: val.SessionToken,
			UserAgent:    clientName,
		}, nil
	}),
)

my code works but randomly after 10+ hours my service caught into an infinite loop with the error looks like this:

image

and then when I look into the Brokers log in the AWS MSK around the same time the error first happened, this error pops (details redacted):

[2022-03-17T03:05:13.000+07:00] INFO [SocketServer brokerId=2] Failed re-authentication with ip-10-X-XXX-XXX.ap-southeast-1.compute.internal/INTERNAL_IP (Cannot change principals during re-authentication from IAM.arn:aws:sts::XXXX0842XXXX:assumed-role/my-app-role/1647457506320093398: IAM.arn:aws:sts::XXXX0842XXXX:assumed-role/my-app-role/1647461106753581893) (org.apache.kafka.common.network.Selector)

Is this caused by an invalid AWS IAM role when fetching offset/heartbeating?
Should I implement a OnGroupManageError hook to handle this "non-retriable" error? I can only think to restart my app in the OnGroupManageError hook to make it simple... but is there any way to recover from this error without needing to restart app/call panic when it happened?

Thank you for your time with this great library.

@twmb
Copy link
Owner

twmb commented Mar 17, 2022

I'll try to help, but I haven't seen this before and I think a lot of this will be having you look into things. If you're ok with that:

That broker error looks directly from the Kafka source code, where it is ensuring the principal is unchanged. AFAICT, this is ensuring that the principal is unchanged on the same connection. The franz-go source supports re-authenticating.

I don't know how exactly the AccessKey vs SessionToken map to the principal, but my guess is that during this reauthentication, one of these is changing and is mapping to a different principal behind the scenes. What's weird to me is that this error should be received during sasl reauthentication, so I'd actually expect connections to fail to be opened entirely...

Do you know which of he AccessKey or SessionToken is mapping to the principal? And, if possible, can the principal be made static?

@twmb
Copy link
Owner

twmb commented Mar 18, 2022

Any luck?

@gunturaf
Copy link
Author

I am sorry that currently I got no spare time to investigate this further so last Friday I just went with the temporary fix by implementing the OnGroupManageError to gracefully restart my container, this is the code I use for anyone who want to use:

type onGroupManageError struct {
}

var _ kgo.HookGroupManageError = (*onGroupManageError)(nil)

func (ogme onGroupManageError) OnGroupManageError(err error) {
	if errors.Is(err, kerr.GroupAuthorizationFailed) {
		err := syscall.Kill(syscall.Getpid(), syscall.SIGINT) // this code will not work on Windows
		if err != nil {
			panic(err) // should trigger panic on Windows
		}
	}
}

For extra context:
I have a topic with 10 partitions, I run this consumer group in a Kubernetes setup (AWS EKS) with HPA turned on and minReplica set to 2, so at maximum, each Replica will assigned to 5 partitions each until rebalance happened due to scale-up or down.
Each containers got the same IAM role that was assigned by the AWS EKS via Service Account and also each containers in this Deployment was assigned with the same AWS_ROLE_ARN value.
My uninformed and no-brainer assumption is that when >1 Replica is running at the same time, at random times (or right before the temporary credential got invalidated) somehow in the middle of fetching offset, renewal process of the temporary credential of 1 container got affected by the other container with same IAM Role. I will investigate this issue again... perhaps later this week... by turning off the HPA for a day so that only 1 container will exists at a time and if this issue still persists I should definitely sit with my infra team for in depth investigation.

Thank you for your support I think we can close this issue for now while I gather more info.

@twmb
Copy link
Owner

twmb commented Mar 21, 2022

Rather than killing itself, would os.Exit(..) work?

I'm open to leaving this issue open or closing it, but if you do happen to to look into it and figure out what's up, this would be great to document.

@twmb
Copy link
Owner

twmb commented Mar 24, 2022

Actually, going to close for now, but if you can think of a good place to put this documented workaround, lmk and we can drop it in a document somewhere (otherwise this issue will be good for history when people search). And, if you figure out what is up, that'd be awesome. Thank you!

@twmb twmb closed this as completed Mar 24, 2022
@twmb
Copy link
Owner

twmb commented Oct 23, 2022

Related issue: #205

@twmb
Copy link
Owner

twmb commented Oct 31, 2022

This may be related to #205, and may be fixed now with eb6e3b5 which is released in v1.9.1.

twmb added a commit that referenced this issue Mar 11, 2023
Multiple users have been hit by this and have not had visibility.

In fact, the original HookGroupManageError can be attributed to having
no visibility.

For #147, #321, #379.
@twmb
Copy link
Owner

twmb commented Mar 11, 2023

I'll start injecting group errors into polling just once in the next release with this PR: #387

twmb added a commit that referenced this issue Mar 11, 2023
Multiple users have been hit by this and have not had visibility.

In fact, the original HookGroupManageError can be attributed to having
no visibility.

For #147, #321, #379.
twmb added a commit that referenced this issue Mar 11, 2023
Multiple users have been hit by this and have not had visibility.

In fact, the original HookGroupManageError can be attributed to having
no visibility.

For #147, #321, #379.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants