-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking CheckMySQL and Query Engine history #11885
Comments
The story is going to start at the beginning of release-6.0 (as it stands now). In this release the
Requests handling.
It is worth noting here that the requests from CheckMySQL Key Takeaways
UPDATE |
This is great. Regarding
This is again for 2PC |
A major change that happened pertaining to this code was in #6396. Let's take a look at release-7.0 which includes these changes next. This PR intends to accomplish 2 things -
The newly created state-manager changed the way it manages the state. Instead of keeping just the current state of the tablet, it stores two of them, the correct state The states themselves are restricted. The states tabletserver's Instead of the previous architecture of having different actions which themselves also checked for the tablet type inside, we now just have 4 separate functions, one for each combination for the serving state requested and the tablet type -
There is also one case for StateNotConnected. In this case we call Let's look at some of the common functions called in more detail -
Requests Handling
CheckMySQL Going back to serving from NotConnected 🚨🚨🚨 BUG ALERT |
#7011 makes a few more changes to how query killing works. Transitioning in transaction engine and fixes a race in the Begin code path. I am not going into too much detail for this since it's not super relevant to our discussion. |
cc - @harshit-gangal @deepthi All in a days work ☝️. Now we know what the intention was and what part is the actual bug. We want to keep the query engine open. But we need to block the queries in the flowchart LR
state1(WantState=Serving\nState=Serving)-->state2(Kill Queries\nClose Transaction Engine)
state2-->state3(Wait For Requests to be empty)
state3-->state4(WantState=Serving\nState=NotConnected)
state4-->state5(retryTransition)
This is obviously wrong because, during the wait for requests, we aren't preventing new requests from coming in. flowchart LR
state1(WantState=Serving\nState=Serving)-->state1b(WantState=NotConnected\nState=Serving)
state1b-->state2(Kill Queries\nClose Transaction Engine)
state2-->state3(Wait For Requests to be empty)
state3-->state4(WantState=NotConnected\nState=NotConnected)
state4-->state4b{WantState=Serving\nState=NotConnected}
state4b-->state5(retryTransition)
I am not sure about the state marked in a diamond. The intent for the retry in the original PR is still unclear to me. Do we want to retry getting into the NotConnected state or do we actually want to spawn a go-routine that tries to take us back to the serving state? |
@GuptaManan100 This is really good. If I have understood correctly, then the change is from |
Another thing I would like to point out is that in the state_manager func (sm *stateManager) closeAll() {
defer close(sm.setTimeBomb())
....
} That ensures that close should happen within this certain period of time otherwise it will crash vttablet which will eventually lead to the restart of vttablet pod / VTOrc kick-in Currently, it is disabled as the default time is set to |
There have been multiple failures in
CheckMySQL
code that cause it to get stuck. This issue documents the history of all the changes that have been made to this codepath to better understand the intent and the evolution of the code over time.The text was updated successfully, but these errors were encountered: