Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Race condition can cause queries to fail when vtgate starts up #16656

Closed
GuptaManan100 opened this issue Aug 27, 2024 · 0 comments · Fixed by #16655
Closed

Bug Report: Race condition can cause queries to fail when vtgate starts up #16656

GuptaManan100 opened this issue Aug 27, 2024 · 0 comments · Fixed by #16655

Comments

@GuptaManan100
Copy link
Member

GuptaManan100 commented Aug 27, 2024

Overview of the Issue

It was noticed there was a race condition that prevented some queries from being buffered when vtgate started up.
The order of operations is such -

  1. HealthCheck is created and starts receiving updates from the tablets.
  2. Keysapce Event Watcher was initialized and it subscribes to the updates from the healthcheck, but these updates are processed asynchronously.
  3. vtgate waits for healthcheck to receive updates from all the primary tablets to be serving.
  4. Then it starts accepting traffic, when one primary becomes non-serving (because of PRS)
  5. Keyspace event watcher hasn't finished processing the updates from the healthcheck, so from PrimaryIsNotServing it returns nil, false because it doesn't have the shard information stored.
  6. This causes the query to be dropped and user getting an error message stating no healthy tablet available

The ideal behaviour is the queries to be buffered. The problem happens because of the race between the keyspace watcher processing the first healthcheck updates it received and vtgate starting to accept queries.

Reproduction Steps

Run a cluster with vitess-operator which has 2 vtgates and at least 3 tablets. Trigger a rolling update of the entire cluster (easiest way to do this is to change the vitess image version), while running continuous query traffic. Repeat until error is seen.

Binary Version

main

Operating System and Environment details

-

Log Fragments

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant