-
-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS_MSK_IAM: kept getting GROUP_AUTHORIZATION_FAILED after several hours of running #147
Comments
I'll try to help, but I haven't seen this before and I think a lot of this will be having you look into things. If you're ok with that: That broker error looks directly from the Kafka source code, where it is ensuring the principal is unchanged. AFAICT, this is ensuring that the principal is unchanged on the same connection. The franz-go source supports re-authenticating. I don't know how exactly the AccessKey vs SessionToken map to the principal, but my guess is that during this reauthentication, one of these is changing and is mapping to a different principal behind the scenes. What's weird to me is that this error should be received during sasl reauthentication, so I'd actually expect connections to fail to be opened entirely... Do you know which of he AccessKey or SessionToken is mapping to the principal? And, if possible, can the principal be made static? |
Any luck? |
I am sorry that currently I got no spare time to investigate this further so last Friday I just went with the temporary fix by implementing the type onGroupManageError struct {
}
var _ kgo.HookGroupManageError = (*onGroupManageError)(nil)
func (ogme onGroupManageError) OnGroupManageError(err error) {
if errors.Is(err, kerr.GroupAuthorizationFailed) {
err := syscall.Kill(syscall.Getpid(), syscall.SIGINT) // this code will not work on Windows
if err != nil {
panic(err) // should trigger panic on Windows
}
}
} For extra context: Thank you for your support I think we can close this issue for now while I gather more info. |
Rather than killing itself, would I'm open to leaving this issue open or closing it, but if you do happen to to look into it and figure out what's up, this would be great to document. |
Actually, going to close for now, but if you can think of a good place to put this documented workaround, lmk and we can drop it in a document somewhere (otherwise this issue will be good for history when people search). And, if you figure out what is up, that'd be awesome. Thank you! |
Related issue: #205 |
I'll start injecting group errors into polling just once in the next release with this PR: #387 |
I use AWS_MSK_IAM for the SASL method, I also set the AWS credentials to also be cached using NewCredentialsCache then the SASL option looks like so:
my code works but randomly after 10+ hours my service caught into an infinite loop with the error looks like this:
and then when I look into the Brokers log in the AWS MSK around the same time the error first happened, this error pops (details redacted):
Is this caused by an invalid AWS IAM role when fetching offset/heartbeating?
Should I implement a
OnGroupManageError
hook to handle this "non-retriable" error? I can only think to restart my app in theOnGroupManageError
hook to make it simple... but is there any way to recover from this error without needing to restart app/call panic when it happened?Thank you for your time with this great library.
The text was updated successfully, but these errors were encountered: