You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It was noticed there was a race condition that prevented some queries from being buffered when vtgate started up.
The order of operations is such -
HealthCheck is created and starts receiving updates from the tablets.
Keysapce Event Watcher was initialized and it subscribes to the updates from the healthcheck, but these updates are processed asynchronously.
vtgate waits for healthcheck to receive updates from all the primary tablets to be serving.
Then it starts accepting traffic, when one primary becomes non-serving (because of PRS)
Keyspace event watcher hasn't finished processing the updates from the healthcheck, so from PrimaryIsNotServing it returns nil, false because it doesn't have the shard information stored.
This causes the query to be dropped and user getting an error message stating no healthy tablet available
The ideal behaviour is the queries to be buffered. The problem happens because of the race between the keyspace watcher processing the first healthcheck updates it received and vtgate starting to accept queries.
Reproduction Steps
Run a cluster with vitess-operator which has 2 vtgates and at least 3 tablets. Trigger a rolling update of the entire cluster (easiest way to do this is to change the vitess image version), while running continuous query traffic. Repeat until error is seen.
Binary Version
main
Operating System and Environment details
-
Log Fragments
No response
The text was updated successfully, but these errors were encountered:
Overview of the Issue
It was noticed there was a race condition that prevented some queries from being buffered when vtgate started up.
The order of operations is such -
PrimaryIsNotServing
it returnsnil, false
because it doesn't have the shard information stored.no healthy tablet available
The ideal behaviour is the queries to be buffered. The problem happens because of the race between the keyspace watcher processing the first healthcheck updates it received and vtgate starting to accept queries.
Reproduction Steps
Run a cluster with vitess-operator which has 2 vtgates and at least 3 tablets. Trigger a rolling update of the entire cluster (easiest way to do this is to change the vitess image version), while running continuous query traffic. Repeat until error is seen.
Binary Version
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: