AWS_MSK_IAM: kept getting GROUP_AUTHORIZATION_FAILED after several hours of running #147

gunturaf · 2022-03-17T04:39:47Z

I use AWS_MSK_IAM for the SASL method, I also set the AWS credentials to also be cached using NewCredentialsCache then the SASL option looks like so:

clientName := "my-app"

// read AWS from local profile:
awsConf, err := config.LoadDefaultConfig(ctx)
if err != nil {
	panic(err)
}

// AWS cached credentials like so:
credProvider := aws.NewCredentialsCache(
				awsConf.Credentials,
				func(options *aws.CredentialsCacheOptions) {
					options.ExpiryWindow = 20 * time.Second
					options.ExpiryWindowJitterFrac = 0.5
				},
			)

// the SASL option looks like this:
kgo.SASL(
	aws.ManagedStreamingIAM(func(ctx context.Context) (aws.Auth, error) {
		val, err2 := credProvider.Retrieve(ctx)
		if err2 != nil {
			return aws.Auth{}, err2
		}
		return aws.Auth{
			AccessKey:    val.AccessKeyID,
			SecretKey:    val.SecretAccessKey,
			SessionToken: val.SessionToken,
			UserAgent:    clientName,
		}, nil
	}),
)

my code works but randomly after 10+ hours my service caught into an infinite loop with the error looks like this:

and then when I look into the Brokers log in the AWS MSK around the same time the error first happened, this error pops (details redacted):

[2022-03-17T03:05:13.000+07:00] INFO [SocketServer brokerId=2] Failed re-authentication with ip-10-X-XXX-XXX.ap-southeast-1.compute.internal/INTERNAL_IP (Cannot change principals during re-authentication from IAM.arn:aws:sts::XXXX0842XXXX:assumed-role/my-app-role/1647457506320093398: IAM.arn:aws:sts::XXXX0842XXXX:assumed-role/my-app-role/1647461106753581893) (org.apache.kafka.common.network.Selector)

Is this caused by an invalid AWS IAM role when fetching offset/heartbeating?
Should I implement a OnGroupManageError hook to handle this "non-retriable" error? I can only think to restart my app in the OnGroupManageError hook to make it simple... but is there any way to recover from this error without needing to restart app/call panic when it happened?

Thank you for your time with this great library.

The text was updated successfully, but these errors were encountered:

twmb · 2022-03-17T05:31:42Z

I'll try to help, but I haven't seen this before and I think a lot of this will be having you look into things. If you're ok with that:

That broker error looks directly from the Kafka source code, where it is ensuring the principal is unchanged. AFAICT, this is ensuring that the principal is unchanged on the same connection. The franz-go source supports re-authenticating.

I don't know how exactly the AccessKey vs SessionToken map to the principal, but my guess is that during this reauthentication, one of these is changing and is mapping to a different principal behind the scenes. What's weird to me is that this error should be received during sasl reauthentication, so I'd actually expect connections to fail to be opened entirely...

Do you know which of he AccessKey or SessionToken is mapping to the principal? And, if possible, can the principal be made static?

twmb · 2022-03-18T23:14:24Z

Any luck?

gunturaf · 2022-03-21T15:16:39Z

I am sorry that currently I got no spare time to investigate this further so last Friday I just went with the temporary fix by implementing the OnGroupManageError to gracefully restart my container, this is the code I use for anyone who want to use:

type onGroupManageError struct {
}

var _ kgo.HookGroupManageError = (*onGroupManageError)(nil)

func (ogme onGroupManageError) OnGroupManageError(err error) {
	if errors.Is(err, kerr.GroupAuthorizationFailed) {
		err := syscall.Kill(syscall.Getpid(), syscall.SIGINT) // this code will not work on Windows
		if err != nil {
			panic(err) // should trigger panic on Windows
		}
	}
}

For extra context:
I have a topic with 10 partitions, I run this consumer group in a Kubernetes setup (AWS EKS) with HPA turned on and minReplica set to 2, so at maximum, each Replica will assigned to 5 partitions each until rebalance happened due to scale-up or down.
Each containers got the same IAM role that was assigned by the AWS EKS via Service Account and also each containers in this Deployment was assigned with the same AWS_ROLE_ARN value.
My uninformed and no-brainer assumption is that when >1 Replica is running at the same time, at random times (or right before the temporary credential got invalidated) somehow in the middle of fetching offset, renewal process of the temporary credential of 1 container got affected by the other container with same IAM Role. I will investigate this issue again... perhaps later this week... by turning off the HPA for a day so that only 1 container will exists at a time and if this issue still persists I should definitely sit with my infra team for in depth investigation.

Thank you for your support I think we can close this issue for now while I gather more info.

twmb · 2022-03-21T17:23:58Z

Rather than killing itself, would os.Exit(..) work?

I'm open to leaving this issue open or closing it, but if you do happen to to look into it and figure out what's up, this would be great to document.

twmb · 2022-03-24T05:41:14Z

Actually, going to close for now, but if you can think of a good place to put this documented workaround, lmk and we can drop it in a document somewhere (otherwise this issue will be good for history when people search). And, if you figure out what is up, that'd be awesome. Thank you!

twmb · 2022-10-23T17:17:55Z

Related issue: #205

twmb · 2022-10-31T18:18:27Z

This may be related to #205, and may be fixed now with eb6e3b5 which is released in v1.9.1.

Multiple users have been hit by this and have not had visibility. In fact, the original HookGroupManageError can be attributed to having no visibility. For #147, #321, #379.

twmb · 2023-03-11T04:58:45Z

I'll start injecting group errors into polling just once in the next release with this PR: #387

Multiple users have been hit by this and have not had visibility. In fact, the original HookGroupManageError can be attributed to having no visibility. For #147, #321, #379.

twmb closed this as completed Mar 24, 2022

twmb added a commit that referenced this issue Mar 11, 2023

kgo: inject the group lost error into polling

d5ea09b

Multiple users have been hit by this and have not had visibility. In fact, the original HookGroupManageError can be attributed to having no visibility. For #147, #321, #379.

twmb mentioned this issue Mar 11, 2023

kgo: inject the group lost error into polling #387

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS_MSK_IAM: kept getting GROUP_AUTHORIZATION_FAILED after several hours of running #147

AWS_MSK_IAM: kept getting GROUP_AUTHORIZATION_FAILED after several hours of running #147

gunturaf commented Mar 17, 2022

twmb commented Mar 17, 2022

twmb commented Mar 18, 2022

gunturaf commented Mar 21, 2022

twmb commented Mar 21, 2022

twmb commented Mar 24, 2022

twmb commented Oct 23, 2022

twmb commented Oct 31, 2022

twmb commented Mar 11, 2023

AWS_MSK_IAM: kept getting GROUP_AUTHORIZATION_FAILED after several hours of running #147

AWS_MSK_IAM: kept getting GROUP_AUTHORIZATION_FAILED after several hours of running #147

Comments

gunturaf commented Mar 17, 2022

twmb commented Mar 17, 2022

twmb commented Mar 18, 2022

gunturaf commented Mar 21, 2022

twmb commented Mar 21, 2022

twmb commented Mar 24, 2022

twmb commented Oct 23, 2022

twmb commented Oct 31, 2022

twmb commented Mar 11, 2023