-
-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka read error with SASL auth: TOPIC_AUTHORIZATION_FAILED: Not authorized to access topics: [Topic authorization failed.] + GROUP_AUTHORIZATION_FAILED: Not authorized to access group: Group authorization failed #205
Comments
Is it possible to grab some logs before the auth failure? Specifically around the broker dialing and connecting. The code looks fine, so something else may be going on. Connections should be reauthenticating, and they should reauthenticate quickly at the right moment. Perhaps something odd is happening on the AWS side that this client "should" account for even more. |
So I have a small "bug" in how I am running the code and let me know if you think this matters (I didn't think it should). Here are the logs and I will add a bit more code in my above comment for context: To begin with everything is just getting logged on host A and host B has no logs for hours.
Initially the host A continues working just fine:
Host A continues reading until 21:45:13 when it hits the
As mentioned before host B is printing the same logs in a loop
And this pattern continues until 21:46:12 when it just stops
But host A continues to print the fetch errors:
and continues to do so until 21:49:46
|
Could you add debug logging and then look for messages that say "sasl"? It's not clear what's up with the logs above. AWS_MSK_IAM should indicate the credential session expiry when the connection is opened. The client does not close connections when the response indicates an auth failure -- perhaps it should. That'd be a bit to wire in. The auth failure shouldn't really be happening, so I'm wondering if Amazon is unexpectedly expiring the session early for some reason. |
Redeployed with the new log level. I will post them when they occur. It seemed like they happened every 12 hours like clockwork. This is what the generally read loop looks like for the host that isn't getting the new data:
And this is what the non-reading hosts boot looks like:
And the generally read loop for the host that is actively consuming from the one partition in the consumer group:
While the hosts that consumes from the consumer group boot sequence looks like this:
And every once in a while they both do something like this:
As you can see the hosts that does read "synced" first, and actually has a partition in the map for the log of "entering OnPartitionsAssigned". |
Okay so here are the logs of the error with debug mode enabled within franz-go. One thing that is different though is it never fell into the infinite loop with the errors, just on the revoking of old partitions and assigning of new partitions, so there was a lot less error logs and resolved after 20 seconds: As expected, the errors happened when it was time to re-authenticate the sasl connection.
and here are the logs of the container that is actively reading:
Maybe every time there needs to be a sasl re-authentication a re-balance happens? Have any ideas on how to avoid this bug? |
|
So far I continue to not get the TOPIC_AUTHORIZATION_FAILED as an error log but I can see it now as a debug log.
I wanted to reduce clutter since there were already a lot of logs to grok. Did you want to make sure each broker was re-authenticating? The logs are always like
I don't post the logs before since there are thousands of small read loops
I think you are onto something though.
In staging it is much more random (
Both environments get the GROUP_AUTHORIZATION btw. Let me post more logs from the staging one since that is simpler to follow as it has 3 brokers as opposed to 6.
Over the course of 2 days the authentication sequence happens 395 times for brokers 1 and 3. Only 39 times for broker 2. But still enough to re-authenticate every 12 hours. I will post logs for what this looks like in a following comment along with more explicit logs leading up to broker 2 authentication sequence.
I think I mentioned in another issue I was debugging all the popular go kafka clients together and having them write heartbeat messages and then read back those messages. Also all are using the same consumer group. There are no errors for reading from a topic in kafka-go. But there are errors when writing to a topic in kafka-go (only when I am doing sasl AWS_MKS_IAM): segmentio/kafka-go#992 Do you have a discord? I don't mind sending a large attachment of logs in a DM over there |
A couple hours of logs around sasl re-authentication... marking the broker:
And more explicit logs of broker 2 sequence when it fails:
As you can see it is kind of rudderless until |
Coming back to this -- IMO this looks like an AWS problem. It sounds like the clock on the affected broker may be a bit screwy or something and credentials are timing out sooner than they should. I'm not sure what can be done within the client to work around this. The client already works around some other problems of AWS (re-authenticating within 1s if AWS gives a 1s expiry...). Per KIP-368, if a client continues to use a connection after SASL has expired, the broker should close the connection ("and for brokers to close connections that continue to use expired sessions"). The client doesn't close connections for failed auth because usually this implies ACL problems, and closing and reopening the connection doesn't improve anything (and a user can add ACLs on the fly and the client will begin authenticating successfully). Is it possible to get some support from Amazon on this issue? |
Yeah let me reach out to them. edit: Still an ongoing discussion. They have said: "These errors generally occur because of insufficient permissions/actions given in authorization policy attached to IAM role which is used to authorize while interacting with MSK cluster." But I pointed out that I have tested on very broad permissions and still see the errors
Waiting to hear back from that point raised. |
fwiw, I think that others have used AWS_MSK_IAM without running into this issue? |
Hey there, for some reason I'm not seeing your reply on the issue (only in
my email), so I'll reply through email as well:
Are you saying that you did not change franz-go at all, did not bump the
version, but something in the cluster itself changed such that now you are
seeing the same issue in this thread?
…On Wed, Oct 19, 2022 at 7:17 AM _ksco ***@***.***> wrote:
No, I also encountered this problem, but no new issues were released. In
fact, when I used this library in May this year, I did not observe this
error for normal production and consumption, but the project architecture
changed and put kafka as idle, without any production and consumption of
messages, and it will appear [29] Topic Authorization Failed: the client
is not authorized to access the requested topic and [3] Unknown Topic Or
Partition: the request is for a topic or partition that does not exist on
this broker
—
Reply to this email directly, view it on GitHub
<#205 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJZIJJ24YXPGGZTSRZLLN3WD7YFRANCNFSM6AAAAAAQSQL5RM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Still working with AWS. They haven't been much help. why do you think I am only seeing this on my kafka readers but not my kafka writers (Which still use sasl msk_iam auth)? Also it is weird that the error itself is an authorization error... but then just gets resolved without me changing the IAM settings. How could this happen? Why is one minute a topic not authorized for access but the next minute it is? I see the errors are defined by you: Line 106 in eb2e62d
Is the error just a best guess or are you getting the erroCodeMap from aws docs? I looked over the aws auth part of the code and I don't see anything that could be going wrong: Maybe there is an issue in determining how long we should wait for re-auth? Lines 785 to 912 in eb2e62d
Would it make sense to, if getting an authorization error, just try to do a sasl re-auth just to be safe? The fact that the errors only appear for a few minutes, Never like 30 minutes, make it seem like a clock issue such as sasl expiry and then not re-authenticating as opposed to auth. |
Are the writers constantly writing? Or, are the writes infrequent? Group consumers by default heartbeat every 3s, and fetches are long polling, so both of these will hit authorization errors relatively immediately. If your writes are constant, this is even more of a mystery because the auth is the same and there should be absolutely no difference (and imo points even more to an AWS problem).
From the logs above, it looks like the errors happen relatively close to when a token is supposed to expire. As you noted, the errors occur until the sasl expiry, at which point the client chooses to re-authenticate. This loads fresh credentials, re-authenticates, and then things start working again.
These are not defined by me, these are Kafka protocol errors: https://kafka.apache.org/protocol#protocol_error_codes
The client is already pessimistic in calculating time. Before #136 was addressed, if the broker said the sasl lifetimes was 5hr, the client would reauthenticate 4h59m59s. After #136, we now require 1.1x sasl latency lowerbound, or 2.5s, whichever is larger. In your logs above, the sasl flow is nearly instantaneous, so I expect that reauthentication is always ~11h59m57.5s after the initial sasl flow (which we can also see above).
This is an option but it's ugly internally and would require a lot of wiring, and it's not really a good solution for the normal case of a person actually truly not having perms: closing and reopening connections only to continue rejecting requests is harder on brokers than just rejecting requests (and could lead to socket exhaustion)
Yes, the client could be even more pessimistic. Six hours is way pessimistic, I'd probably err on 15 to 20 minutes earlier, max. FWIW, the Java client itself is also very pessimistic; it reauths after a random percentage of the lifetime ms between 85% and 95%: https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslClientAuthenticator.java#L687-L692. This is probably why Java clients do not run into the problem. I'll add a more pessimistic reauth calculation (edited out pseudocode, see real code) |
The writers write every 5 seconds. There are 2 writers since I have 2 replicas running. So there is a write on average every 2.5 seconds.
Okay yeah the fact that msk gives me unathorized error seems to indicate the brokers prematurely expire.
I don't think i have seen it error more than 10 minutes. Generally have errors for like 3-5 minutes before the sasl re-auth happens
wow okay
Yeah i think it makes sense to match what the java client is doing. |
@ekeric13 are you able to try the updates branch to see if that fixes your issue? |
@twmb yeah I deployed it the other day (soon after I responded to you) and I haven't seen any errors yet! Was waiting a bit longer to call it "done" though since in the past it would randomly go a while without errors as well. But so far so good. |
Awesome, good to hear. I'll wait another day or two and assume that no response (or good response) means it's good, and a reply that it's bad indicates things are still bugged |
Yeah i think it is fair to say the change is good |
This is released in v1.9.1 |
I have a service that continuosly writes (and then reads) from the same topic every 5 seconds.
I have two different clients using franz-go, one with TLS auth and one with SASL via msk_iam auth.
My writer seems to be fine but errors every few hours. The errors are always the same.
I will get 200k of these errors over the course of 4 minutes and then everything resumes fine again.
My reader code was based heavily on this example:
https://github.com/twmb/franz-go/blob/master/examples/goroutine_per_partition_consuming/autocommit_marks/main.go
This is a snipped of the code:
And this is how the client is instantiated:
partitions assigned callback:
partitions lost callback:
The text was updated successfully, but these errors were encountered: