You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consider an unsharded keyspace that is being used only to store a solitary sequence table.
If for some reason, the semi-sync ACKs by a replica are lost when the primary writes to this sequence table, then it will lead to the primary getting indefinitely stuck.
For any other keyspace a single write losing semi-sync ACKs is not an issue, because the next write will get the ACKs and the ACKs are cumulative unblocking the previous write as well.
Specifically in this case, when there is a single sequence table, after the first write gets stuck, no other write can go through because the sequence table only has 1 row! All subsequent writes conflict with the already stuck write and also get blocked until they timeout!
Reproduction Steps
Setup a sharded keyspace that uses a single sequence table that is stored in the unsharded keyspace.
Manipulate iptables (linux), or pf configuration (Mac) to make the primary lose the semi-sync ACKs on a write to the said table.
See that the keyspace is forever stuck even when the network disruption is resolved.
Binary Version
main
Operating System and Environment details
-
Log Fragments
The text was updated successfully, but these errors were encountered:
deepthi
changed the title
Bug Report: A dropped semi-sync ACK can lead to a primary tablet being indefenitely stuck
Bug Report: A dropped semi-sync ACK can lead to a primary tablet being indefinitely stuck
Feb 13, 2025
The proposed solution is to introduce a new semi-sync monitor that will check for the variable Rpl_semi_sync_source_wait_sessions. If it finds that the primary is blocked on semi-sync ACKs, then it will start doing writes to the database to an internal vitess table.
If these set of writes unblock the primary, then everything is fine. If they don't in some configurable amount of time, then we will signal VTOrc that the primary is not accepting writes anymore, and it will run an ERS.
Overview of the Issue
Consider an unsharded keyspace that is being used only to store a solitary sequence table.
If for some reason, the semi-sync ACKs by a replica are lost when the primary writes to this sequence table, then it will lead to the primary getting indefinitely stuck.
For any other keyspace a single write losing semi-sync ACKs is not an issue, because the next write will get the ACKs and the ACKs are cumulative unblocking the previous write as well.
Specifically in this case, when there is a single sequence table, after the first write gets stuck, no other write can go through because the sequence table only has 1 row! All subsequent writes conflict with the already stuck write and also get blocked until they timeout!
Reproduction Steps
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: