-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SHOW VITESS_SHARDS intermittently returns an empty list of shards #5038
Comments
I think the main issue is that this hits Zookeeper and we do run We should ideally be caching this instead. |
Actually following the codepath even further I can see now that this does not hit Zookeeper directly for every request. I think that's great but it doesn't explain why sometimes |
I can reproduce this locally easily against a standalone vtgate and the example DB:
Running this gives, random failures, typically around 2 per 10000 selects on my machine, but it seems somewhat random, e.g.:
I will dig further. |
This is definitely related to the ResilientServer cache refresh. If I bump the vtgate srv_topo_cache_ttl to something high (say 60s); I can run the testcase without any failures. |
So, there's a race in the default configuration, even when ZK isn't being slow, because we have the TTL and refresh period set to the same value (1 second). One solution is to make sure the refresh period is lower than the TTL (e.g. TTL 2s, leave refresh as default of 1s). We could consider changing the defaults to this. Another fix would be to eliminate the race by resetting the entry.lastQueryTime in resilient_server.go to the entry.insertionTime when the topology has been refreshed. |
Having a fix to this would be really nice. We're using a workaround for this but it's quite unsatisfactory and will likely cause issues around shard splits. |
…uested, instead of zeroing it out. Signed-off-by: Jacques Grove <aquarapid@gmail.com>
SHOW VITESS_SHARDS
intermittently returns an empty list of shardsI think the issue can be traced down to these lines of code at
go/vt/vtgate/executor.go:800
:As you can see errors are ignored. So if there is for example an intermittent connectivity problem with the topology server then we would return an empty list rather than signaling an error.
I think the correct solution is to catch and ignore specific errors but in general pass the errors on.
The text was updated successfully, but these errors were encountered: