Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd Find seems unreliable when watch is enabled #512

Closed
zolug opened this issue Jul 24, 2024 · 1 comment · Fixed by #523
Closed

etcd Find seems unreliable when watch is enabled #512

zolug opened this issue Jul 24, 2024 · 1 comment · Fixed by #523
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@zolug
Copy link

zolug commented Jul 24, 2024

I have some questions and possible concerns with etcdNSERegistryServer's Find feature when the query has watch enabled.

My goal would be to monitor creation and deletion of all NSEs in a particular network service and handle the NSM connections independently for each NSE.

For this, we rely on NetworkServiceEndpointRegistryClient.Find to create a stream to read events from.

1.,
If I got it right, this stream will get closed in 60 seconds latest due to how etcdNSERegistryServer's watch() function creates the Watch object. Meaning the user must call NetworkServiceEndpointRegistryClient.Find again periodically.

My concerns with this approach is, that it does not look feasible when the goal is to track NSE removals reliably. Because an NSE might unregister in that gap when the the previous stream is already closed but the new one is not established yet.

I guess there can be a similar gap when the user stream gets terminated because the local nsmgr disappears e.g. due to upgrade. (The NSE might have run on a different worker than the user watching it.)
Similarly, the registry POD serving a watch might disappear as well.

2.,
Find allows removal of expired NSEs by checking their expiration time when listing them from etcd.
If this Find call that takes care of such NSE removal has a request with watch enabled, the user won't be notified about the removal because the watch part is not running yet, and no Deleted response is sent on the stream when doing removal.

3.,
etcdNSERegistryServer's handleWatcher() function enforces a weird version check between the etcd NSE event and the locally stored NSE item. And skips events if the above two match.

This causes a really weird behaviour:
There will be a race between active Find watch calls and Registers (the latter updating the locally stored version of NSE). Thus in case of multiple active Find watch calls, some of the watchers might skip events that others might accept depending on when the local version will get updated in the sync Map.

I fail to see the purpose if this version check. Maybe the intention was to implement some kind of damper not to send too many events, but IMHO it might be a faulty design. There could be dozens of Find streams parallel with watch enabled, and they all must be informed anyways.
Luckily, due to the 1 minute timeout of the Watcher object the Find stream only lives for a minute tops, thus when the user calls Find the next time and the code fetches the NSEs using List action then they finally can be informed (although with some delay) about available NSE.

Btw., the number k8s-registry replicas can add an extra funkiness depending where NSE Registers and Finds are served.

Fortunately, removal does not seem to be affected by this version check, because Delete changes the resourceVersion.

Summary:

As a summary - if I haven't misinterpreted sg. - in case of multiple watcher some users watching NSEs might miss removal notifications due to 1. and 2.
And an issue of minor importance can occur due to 3. where create notification might be received with some delay.
Thus, for me it seems if I wanted to reliably keep track of NSE removals, my code should execute an additional Find call without watch to get hold of the current set of available NSEs whenever my watch stream returns so to say. (And compare that set against the NSEs my code is aware of.)

@zolug zolug added bug Something isn't working question Further information is requested labels Jul 24, 2024
@denis-tingaikin denis-tingaikin self-assigned this Jul 24, 2024
@Ex4amp1e Ex4amp1e moved this to In Progress in Release v1.14.0 Jul 25, 2024
@zolug
Copy link
Author

zolug commented Jul 25, 2024

Btw, I don't think my idea to simply fetch the list of available NSEs with a second Find call (without watch) could work without problems.

That's because IMHO the user can't really tell if it received all NSEs e.g. when the server had closed the user stream because the server lost its connection to the registry. The helper function registry.ReadNetworkServiceChannel simply transfers responses from the registry side stream but is not bothered by informing the user about non EOF error. (Also, no data count information is forwarded at the start.)
So, it's hard to tell if an NSE have been really unregistered.

NSMgr crash or NSMgr POD delete seems to be covered, as the user side stream at least will get a non EOF error in such cases.
(But maybe in case of graceful termination this is mere luck due to the current implementation details.)

@denis-tingaikin denis-tingaikin moved this from In Progress to Blocked in Release v1.14.0 Aug 19, 2024
denis-tingaikin added a commit that referenced this issue Sep 8, 2024
Signed-off-by: denis-tingaikin <denis.tingajkin@xored.com>
@denis-tingaikin denis-tingaikin moved this from Blocked to In Progress in Release v1.14.0 Sep 12, 2024
@denis-tingaikin denis-tingaikin moved this from In Progress to Blocked in Release v1.14.0 Sep 13, 2024
@denis-tingaikin denis-tingaikin moved this from Blocked to In Progress in Release v1.14.0 Sep 18, 2024
@denis-tingaikin denis-tingaikin moved this from In Progress to Under review in Release v1.14.0 Sep 24, 2024
@denis-tingaikin denis-tingaikin moved this from Under review to Moved to next release in Release v1.14.0 Sep 24, 2024
@github-project-automation github-project-automation bot moved this from Moved to next release to Done in Release v1.14.0 Sep 24, 2024
@NikitaSkrynnik NikitaSkrynnik moved this from Done to In Progress in Release v1.14.1 Oct 9, 2024
@NikitaSkrynnik NikitaSkrynnik moved this from In Progress to Done in Release v1.14.1 Oct 10, 2024
@NikitaSkrynnik NikitaSkrynnik closed this as completed by moving to Done in Release v1.14.1 Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants