-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resharding: SwitchWrites leaves source masters in an unhealthy state #6822
Comments
I will be discussing it with sugu today and will update you asap. |
The race you mentioned does exist. Here is how we are thinking of solving this:
Thoughts? |
I agree this looks like it will fix the race, but I also want to be sure to address the original health check problem I was trying to fix when I noticed this race. The problem is, the vitess operator (or any other system monitoring vitess) needs a way to ask the tablet, "Are you ok?" in a simple way. Currently we use So if we decide to leave the query service disabled to fix the race, I propose we should find a way to make |
@deepthi, we were wondering about this yesterday: turns out the issue is that the operator uses /healthz which returns unhealthy just based on the fact that it is not serving. So the proposed solution will resolve the race but not the original issue we wanted to fix: spurious health check error reports from the source. I assume it will break other services if we change this logic based on DisableQueryService. One solution might be to send an additional X-header in the HTTP/500 error and the operator takes that into account. But it sounds hacky : is there a cleaner solution? I guess this is a common issue others using the new Reshard workflows will encounter. |
Documenting offline discussions so far. A decision is still pending, requires a bit more investigation. -- However, if we go back to pre-7.0 (before the tabletmanager rewrite), In summary, a So we can claim that changing Where do we go from here?
We should fix it to behave the same as it used to pre-7.0, i.e. report
This is under the naive assumption that if
For 1 and 2, we clearly want the tablet report The other option is to be conservative and make exceptions for the known conditions. Then we would do something like this:
This is unsatisfying and no more future proof. If we do introduce some other I would like us to figure out exactly what we want to do for 3,4,5 in the list above before proceeding with a fix. We can update this issue once we have answers to these questions. Note:
|
StateManager has
|
This looks like a clean and elegant solution. Thanks @sougou! |
@rohit-nayak-ps I'm not making #7090 close this issue, assuming that you still want to fix the race condition. |
I'm closing this as stale for now as I'm not sure there's anything left to do here today in v14/v15. Please let me know if I missed or misunderstood something and we can re-open it. Thanks! |
After completing a resharding workflow with SwitchWrites, the master tablets of the original source shards are left in an unhealthy state with Reason: "TabletControl.DisableQueryService set", despite the fact that there are no tablet controls in the shard record or SrvKeyspace. Manually issuing a RefreshState command to these tablets makes them healthy again.
I thought at first that the fix might be to have SwitchWrites call RefreshState on the source shard masters after it calls MigrateServedType, which removes all tablet controls from the SrvKeyspace:
vitess/go/vt/topo/srv_keyspace.go
Lines 511 to 514 in bbd31e2
However, I'm wondering if that would exacerbate a potential race condition. If we allow a source master to start serving again, what if there's still a vtgate somewhere that sends writes to it because that vtgate hasn't seen the updated routing info in SrvKeyspace yet?
The scenario I'm thinking about is:
Is there anything that prevents this race? If so, then it should be safe to automate step 4 to leave the source shards in a healthy state after SwitchWrites. If not, then the potential for this race to occur already exists since you could replace step 4 with, "a source master crashes and restarts," but automating the RefreshState would make the race more likely.
The text was updated successfully, but these errors were encountered: