Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: A dropped semi-sync ACK can lead to a primary tablet being indefinitely stuck #17749

Open
GuptaManan100 opened this issue Feb 12, 2025 · 1 comment · May be fixed by #17763
Open

Comments

@GuptaManan100
Copy link
Member

Overview of the Issue

Consider an unsharded keyspace that is being used only to store a solitary sequence table.
If for some reason, the semi-sync ACKs by a replica are lost when the primary writes to this sequence table, then it will lead to the primary getting indefinitely stuck.

For any other keyspace a single write losing semi-sync ACKs is not an issue, because the next write will get the ACKs and the ACKs are cumulative unblocking the previous write as well.

Specifically in this case, when there is a single sequence table, after the first write gets stuck, no other write can go through because the sequence table only has 1 row! All subsequent writes conflict with the already stuck write and also get blocked until they timeout!

Reproduction Steps

  1. Setup a sharded keyspace that uses a single sequence table that is stored in the unsharded keyspace.
  2. Manipulate iptables (linux), or pf configuration (Mac) to make the primary lose the semi-sync ACKs on a write to the said table.
  3. See that the keyspace is forever stuck even when the network disruption is resolved.

Binary Version

main

Operating System and Environment details

-

Log Fragments

@deepthi deepthi changed the title Bug Report: A dropped semi-sync ACK can lead to a primary tablet being indefenitely stuck Bug Report: A dropped semi-sync ACK can lead to a primary tablet being indefinitely stuck Feb 13, 2025
@GuptaManan100
Copy link
Member Author

The proposed solution is to introduce a new semi-sync monitor that will check for the variable Rpl_semi_sync_source_wait_sessions. If it finds that the primary is blocked on semi-sync ACKs, then it will start doing writes to the database to an internal vitess table.
If these set of writes unblock the primary, then everything is fine. If they don't in some configurable amount of time, then we will signal VTOrc that the primary is not accepting writes anymore, and it will run an ERS.

The work for this fix has started in #17763.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant