Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "invalid short sasl lifetime millis" periodically with AWS MSK using SASL authentication #136

Closed
Tracked by #135
sharonx opened this issue Feb 22, 2022 · 21 comments

Comments

@sharonx
Copy link

sharonx commented Feb 22, 2022

I'm getting "invalid short sasl lifetime millis" periodically (after a few hours) with AWS MSK using SASL authentication.

Looking at the code, it errors out when the lifetimeMillis is less than 5 sec but would re-authenticate if the time is greater than 5 sec.

The periodical re-authentication works most of the time because AWS auth token expires every 15min and I only see this issue after a few hours. When it happens, lifetimeMillis become less than 5s. Could this be a clock sync issue? Any idea how I can prevent this error from happening?

@twmb
Copy link
Owner

twmb commented Feb 22, 2022

@sharonx what milli is being logged there? It surprises me that AWS would send back that a token is valid for <5s.

Also, does this seem to happen every 15 hours?

@sharonx
Copy link
Author

sharonx commented Feb 22, 2022

Thanks @twmb for the quick reply.

The milli logged are anywhere between 1000 and 5000. This happens irregularly from 2 hours - 12 hours.

@twmb
Copy link
Owner

twmb commented Feb 22, 2022

Have you seen <1000? I can remove the restriction on tokens being at least 5s.

I just took a skim of the Java code, it looks like they use no restriction on the token lifetime, but choose somewhere between 85% and 95% of the lifetime to reauthenticate.

It's pretty odd that the broker returns such short lived tokens IMO, but besides added latency, it's ok. My main worry would be spin looping a bit if I try to reauthenticate and get such a short lifetime that I reauthenticate again immediately, and then again.

Are you open to trying the v1.4 branch? I'd rather put this in v1.4 rather than have another v1.3 path release, and I'm planning to tag v1.4 either tonight or tomorrow (waiting for some feedback on one commit in it).

@sharonx
Copy link
Author

sharonx commented Feb 22, 2022

I haven't seen any log with < 1000. The smallest I've seen is 1114 out of 35 data points. Is it possible to change the code to re-authenticate as long as the token has less than 5sec (or configurable) lifetime. But like you said, I think removing the 5s restriction would work too.

I'm open to try the v1.4 branch.

@twmb
Copy link
Owner

twmb commented Feb 22, 2022

Cool, I'll push a commit to v1.4 in 40 minutes to an hour, this will be promoted to stable shortly.

@sharonx
Copy link
Author

sharonx commented Feb 22, 2022

Awesome! Thanks again @twmb

@twmb
Copy link
Owner

twmb commented Feb 22, 2022

Mind trying dc51630?

@sharonx
Copy link
Author

sharonx commented Feb 23, 2022

I just pulled the commit off and running the test now. I'll let it run for a while report back tomorrow.

@twmb
Copy link
Owner

twmb commented Feb 23, 2022

@sharonx actually I just realized a small mistake with that commit and have squashed into it. The commit as linked previously used the lifetime as given from the broker directly, rather than subtracting anything off of it (bad variable reuse, and the new linters I use mostly would have noticed this), if you can use 6e4c5b5, that now uses the new expiry variable properly.

@sharonx
Copy link
Author

sharonx commented Feb 23, 2022

@twmb I tested the commit 6e4c5b5 with two running jobs that continuously writes data to a kafka topic for 13 hours. Each job runs inside a k8s pod. One pod failed after about 6 hours with an error

Error: TOPIC_AUTHORIZATION_FAILED: Not authorized to access topics: [Topic authorization failed.]

@twmb
Copy link
Owner

twmb commented Feb 23, 2022

Interesting, I wonder if if reauthentication wasn't quick enough due to being 80% of the session lifetime if the session lifetime is <5s. Maybe I could change it to just reauthenticate immediately if <5s.

@sharonx
Copy link
Author

sharonx commented Feb 23, 2022

I think reauthenticate immediately if <5s is a better idea. It'll be more flexible if I can configure the time but not a big deal.

Another option would be to retry authentication when TOPIC_AUTHORIZATION_FAILED or similar auth issue happens. Such issues happen sometimes due to MSK maintenance. This is not a big deal for us though since the maintenance doesn't happen very often and our pods will restart and try to authenticate again.

@twmb
Copy link
Owner

twmb commented Feb 23, 2022

Per KIP-368, AWS should be killing the connection, not accepting requests and rejecting them as invalid auth.

I'll take a look at another form of patching this and squash into place again (one benefit of an unmerged branch, cleaner commits...). Ideally, <1hr eta. I'm also working on a new feature or two that I want to squeeze into v1.4, so an official tag may not come until next week. I'll post here when I squash into place.

@twmb
Copy link
Owner

twmb commented Feb 23, 2022

New commit to try, 3e6f0fd

Now, we always take 2.5s off the lifetime (or 1.1x the reauth latency, whichever is larger, but I expect 2.5s to always be larger). If this causes an immediate reauth, we sleep 100ms to avoid hot looping. If we reauth 15x in a row (which is overkill), the client will kill the connection itself and retry on a new connection.

This should also avoid the TOPIC_AUTHORIZATION_FAILED error, which is indistinguishable to if you at runtime actually did remove permissions from this account (so we cannot react to that error reliably).

@sharonx
Copy link
Author

sharonx commented Feb 23, 2022

Sounds good. I'm running the test with the new commit. I'll report back by tomorrow.

@sharonx
Copy link
Author

sharonx commented Feb 24, 2022

@twmb I think this new commit is stable. I have two pods that are running for over 20 hours without any restarts 🥳!

I think the next step is to make a new release with this commit. Is this something you can pick up soon?

@twmb
Copy link
Owner

twmb commented Feb 24, 2022

I'm currently testing whether it makes sense to serialize calling all promises, as well as touching up a few things in how promises are currently called. I expect 1.4 to be tagged sometime next week (preferably the earlier half of the week).

@twmb
Copy link
Owner

twmb commented Feb 27, 2022

1.4 is going to be a large release, so I'm going to split this into v1.3.5 if you'd rather avoid bumping to v1.4 (although I appreciate more testing of 1.4 😁 )

@twmb
Copy link
Owner

twmb commented Feb 28, 2022

This has been released in v1.3.5!

@twmb twmb closed this as completed Feb 28, 2022
@twmb
Copy link
Owner

twmb commented Mar 1, 2022

(also v1.4.0 is tagged now)

@sharonx
Copy link
Author

sharonx commented Mar 4, 2022

Thank you again. I upgraded to 1.3.5 and everything work well so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants