-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lots of ACL errors due to SI tokens not being found #7441
Comments
@jorgemarey Could you provide a little more information about how the tokens are changing. I suspect this is related to the watch initiated here: consul/agent/proxycfg/state.go Lines 245 to 260 in 709932f
The token used for this request is the token that was used to register the service with the agent or the agents default token at the time of registration. It doesn't look like this token will ever change after we start tracking a proxy service though so if you were registering a service with a new token, or simply deleting that token then that could be the cause here. Regardless, a little more insight into what you are doing with token management could help to track this down. |
Hi @mkeeler , thanks for the replay. I'm not doing anything special. I just run nomad with consul and ACLs. I used the simple job file used as example on their examle In the nomad code they request a token here. For what I can see, the tokens that reach consul in the Intention.Match function are the ServiceIdentity tokens that nomad creates for the envoy containers. Every time it adds a new container, it creates a new ServiceIdentity token for that envoy, and when it destroys it, nomad deletes the token. The errors I'm seeing are with tokens that are alrealy deleted, but the consul client seems to be still making that request to the servers with those tokens. I guess maybe this issue is on their side? But as I rebooted nomad and the error persisted, but when I did that with consul it stoped I thought this was more related to consul. If I can provide more info I'll be happy to do so, but I think this is more related to what nomad do. Thanks! |
Hi again, @mkeeler I was doing some debbuging and found that this is failing in a loop here The fetch function calls refresh and then that calls fetch again without ever stopping. Lines 664 to 688 in 2cf0a3c
I compiled the consul code with some more debug lines and deployed this to our test environment and I could verify this. Could a request to ConnectAuthorize trigger that behaviour? |
Hi again, just to provide more info. The request for the same token are made during 3 days, After those 3 days errors for that ACL token stop. I guess that related to the cache duration that's configured in consul. |
@jorgemarey Thanks for all the info. This definitely does seem to be related to the cache continuing its background refresh even when nothing is requesting it. |
Hi @mkeeler any news on this? We just upgraded a prod cluster to 1.7.4 and keep seeing this error. As we also use nomad, this is a highly dynamic enviroment with a lot of tokens created and deleted. It doesn't affect the correct behaviour of the environment, but generates a lot of error logs. I don't know if this is the same problem. As the one issued here (this is a permission denied)
If the token making the requests is the agent one, we have:
I chaged service_prefix to "write" but the error persists. |
It sounds like this may be partially related to #4968. It has to do with cache TTLs being 72h, which doesn't work well with tokens that change frequently. #8092 made some changes in this area to allow the cache to stop waiting on a fetch if a context is cancelled. It may be possible to pass the context further into |
I'm not absolutely sure that we have the same problem, but we also experience a flood of ACL errors after migrating to Workload Identities and some random deployments in Nomad. First, it fails on |
Overview of the Issue
Hi, we have a cluster with consul 1.7.1 and nomad 0.10.4, and we're testing connect with ACL in that environment. After some tests (everything worked perfectly), we're seeing the following:
Client Logs:
These log messages are appearing in some nodes at a rate of several per minute. I added some more log, compiled and changed a server node, and I'm seeing Intention.Match requests made with tokens that were already deleted (previous envoy nomad tasks):
Server logs:
After rebooting consul these log messages stop appearing.
I don't know if this is related to the cache the agents maintain, but I don't think these messages should appear constantly.
Thanks!
The text was updated successfully, but these errors were encountered: