Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resharding: SwitchWrites leaves source masters in an unhealthy state #6822

Closed
enisoc opened this issue Oct 5, 2020 · 10 comments
Closed

Resharding: SwitchWrites leaves source masters in an unhealthy state #6822

enisoc opened this issue Oct 5, 2020 · 10 comments

Comments

@enisoc
Copy link
Member

enisoc commented Oct 5, 2020

After completing a resharding workflow with SwitchWrites, the master tablets of the original source shards are left in an unhealthy state with Reason: "TabletControl.DisableQueryService set", despite the fact that there are no tablet controls in the shard record or SrvKeyspace. Manually issuing a RefreshState command to these tablets makes them healthy again.

I thought at first that the fix might be to have SwitchWrites call RefreshState on the source shard masters after it calls MigrateServedType, which removes all tablet controls from the SrvKeyspace:

// We are finishing the migration, cleaning up tablet controls from the srvKeyspace
if tabletType == topodatapb.TabletType_MASTER {
partition.ShardTabletControls = nil
}

However, I'm wondering if that would exacerbate a potential race condition. If we allow a source master to start serving again, what if there's still a vtgate somewhere that sends writes to it because that vtgate hasn't seen the updated routing info in SrvKeyspace yet?

The scenario I'm thinking about is:

  1. Add DisableQueryService tablet control for MASTER type to source shards. Call RefreshState on source masters to ensure they stop serving writes.
  2. Wait for vreplication to catch up.
  3. Change routing in SrvKeyspace to tell vtgates to use the target shards. Also clear out all tablet controls from the SrvKeyspace.
  4. Call RefreshState on source masters to make them healthy again.
  5. Rogue vtgate sends writes to the source master.

Is there anything that prevents this race? If so, then it should be safe to automate step 4 to leave the source shards in a healthy state after SwitchWrites. If not, then the potential for this race to occur already exists since you could replace step 4 with, "a source master crashes and restarts," but automating the RefreshState would make the race more likely.

@PrismaPhonic
Copy link
Contributor

@rohit-nayak-ps ?

@rohit-nayak-ps
Copy link
Contributor

I will be discussing it with sugu today and will update you asap.

@rohit-nayak-ps
Copy link
Contributor

The race you mentioned does exist. Here is how we are thinking of solving this:

  • In step 3 we do not clear out the DisableQueryService tablet control on the source

  • So at the end of the workflow we leave the source tablets with the DisableQueryService tablet control.

  • Then if rogue vtgates try to access the source tablets it won't succeed because source tabletserver won't be serving

  • On a reverse workflow we will need to enable the query service on the source. We will be adding a step while switching reads/writes to remove this record on all targets.

  • Since vreplication resides in the tablet manager and runs on the target the DisableQueryService will not affect vreplication

  • We do not need to call RefreshMasters on the source with this approach, if I understand correctly since the later steps only change the tablet controls on the target.

Thoughts?

@enisoc
Copy link
Member Author

enisoc commented Oct 22, 2020

I agree this looks like it will fix the race, but I also want to be sure to address the original health check problem I was trying to fix when I noticed this race.

The problem is, the vitess operator (or any other system monitoring vitess) needs a way to ask the tablet, "Are you ok?" in a simple way. Currently we use /healthz for this. If we leave the query service disabled on the original source tablets, then /healthz will continue to report as unhealthy indefinitely. To the vitess operator (and other monitoring systems) this looks like a problem, but it really isn't.

So if we decide to leave the query service disabled to fix the race, I propose we should find a way to make /healthz report healthy when the query service is disabled for this particular reason (the shards are not in the serving partitioning). I'm not sure what's a clean way to do that.

@rohit-nayak-ps
Copy link
Contributor

@deepthi, we were wondering about this yesterday: turns out the issue is that the operator uses /healthz which returns unhealthy just based on the fact that it is not serving. So the proposed solution will resolve the race but not the original issue we wanted to fix: spurious health check error reports from the source. I assume it will break other services if we change this logic based on DisableQueryService.

One solution might be to send an additional X-header in the HTTP/500 error and the operator takes that into account. But it sounds hacky : is there a cleaner solution? I guess this is a common issue others using the new Reshard workflows will encounter.

@deepthi
Copy link
Member

deepthi commented Nov 2, 2020

Documenting offline discussions so far. A decision is still pending, requires a bit more investigation.

--
Let us start with what /healthz and /debug/health report today.
The description for healthz simply says "Health Check". whereas for debug/health we have "Query service healthcheck".
The current implementation of healthz requires the tablet to be serving (with the exception of SPARE, see Note), whereas debug/health requires that a simple query be successful (i.e. tests the whole path all the way to mysql and back).

However, if we go back to pre-7.0 (before the tabletmanager rewrite), healthz worked somewhat differently. (Credit to @enisoc for figuring this out)
It was checking tabletserver's isServing only if shouldBeServing was true. shouldBeServing was false when DisableQueryService was set. SPARE was handled by changeCallback setting DisableQueryService when it sees tablet_type of SPARE.

In summary, a SPARE tablet or a tablet with DisableQueryService set (for instance, because it is the source for a resharding that has completed) would actually report ok on the healthz endpoint.

So we can claim that changing healthz to report not ok for SPARE tablets or for tablets with DisableQueryService set was technically a breaking change in 7.0.

Where do we go from here?
The simple implementation of healthz that we had after the tabletmanager rewrite looked like this:

		if !tsv.sm.IsServing() {
			http.Error(w, "500 internal server error: vttablet is not serving", http.StatusInternalServerError)

We should fix it to behave the same as it used to pre-7.0, i.e. report ok if DisableQueryService is set or tablet is of type SPARE regardless of IsServing()
One way to accomplish this is to replace this with the following:

		if tsv.sm.reason == "" && !tsv.sm.IsServing() {

This is under the naive assumption that if reason is set to a non-empty value, then we expect that the tablet is not serving for a valid reason and so we don't care if it is in fact NOT_SERVING.
It is not clear that this assumption is actually correct.
As of today, reason can be one of:

  1. TabletControl.DisableQueryService set
  2. master tablet with filtered replication on
  3. not a serving tablet type(%v)
  4. demotion in progress
  5. service stopped

For 1 and 2, we clearly want the tablet report ok. The rest are not so clear-cut.
On 3, BACKUP and RESTORE are also non-serving types. Should we report ok during a backup/restore?
What about 4 & 5? What are the conditions under which we would say service stopped?
Might we introduce other values of reason in future? What happens then?

The other option is to be conservative and make exceptions for the known conditions. Then we would do something like this:

		if tsv.sm.Target().TabletType != topodatapb.TabletType_SPARE &&
			tsv.sm.reason != "TabletControl.DisableQueryService set" &&
			tsv.sm.reason != "master tablet with filtered replication on" &&
			!tsv.sm.IsServing() {

This is unsatisfying and no more future proof. If we do introduce some other reason we will need to evaluate how it affects healthz.

I would like us to figure out exactly what we want to do for 3,4,5 in the list above before proceeding with a fix. We can update this issue once we have answers to these questions.

Note:
#6908 was an attempt to fix this for SPARE, hence the special handling in the code as it stands today, which reads:

		if tsv.sm.Target().TabletType != topodatapb.TabletType_SPARE && !tsv.sm.IsServing() {

@sougou
Copy link
Contributor

sougou commented Nov 2, 2020

StateManager has wantState and state. I think healthy could be:

  • If wantState is StateServing, then state must be StateServing.
  • If wantState is not StateServing, then we always declare ourselves healthy?

@deepthi
Copy link
Member

deepthi commented Dec 1, 2020

StateManager has wantState and state. I think healthy could be:

  • If wantState is StateServing, then state must be StateServing.
  • If wantState is not StateServing, then we always declare ourselves healthy?

This looks like a clean and elegant solution. Thanks @sougou!

@deepthi
Copy link
Member

deepthi commented Dec 1, 2020

@rohit-nayak-ps I'm not making #7090 close this issue, assuming that you still want to fix the race condition.
We can close it once that is also addressed.

@mattlord
Copy link
Contributor

I'm closing this as stale for now as I'm not sure there's anything left to do here today in v14/v15.

Please let me know if I missed or misunderstood something and we can re-open it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants