Resharding: SwitchWrites leaves source masters in an unhealthy state #6822

enisoc · 2020-10-05T21:04:42Z

After completing a resharding workflow with SwitchWrites, the master tablets of the original source shards are left in an unhealthy state with Reason: "TabletControl.DisableQueryService set", despite the fact that there are no tablet controls in the shard record or SrvKeyspace. Manually issuing a RefreshState command to these tablets makes them healthy again.

I thought at first that the fix might be to have SwitchWrites call RefreshState on the source shard masters after it calls MigrateServedType, which removes all tablet controls from the SrvKeyspace:

vitess/go/vt/topo/srv_keyspace.go

Lines 511 to 514 in bbd31e2

    
           // We are finishing the migration, cleaning up tablet controls from the srvKeyspace 
        
           if tabletType == topodatapb.TabletType_MASTER { 
        
           	partition.ShardTabletControls = nil 
        
           }

However, I'm wondering if that would exacerbate a potential race condition. If we allow a source master to start serving again, what if there's still a vtgate somewhere that sends writes to it because that vtgate hasn't seen the updated routing info in SrvKeyspace yet?

The scenario I'm thinking about is:

Add DisableQueryService tablet control for MASTER type to source shards. Call RefreshState on source masters to ensure they stop serving writes.
Wait for vreplication to catch up.
Change routing in SrvKeyspace to tell vtgates to use the target shards. Also clear out all tablet controls from the SrvKeyspace.
Call RefreshState on source masters to make them healthy again.
Rogue vtgate sends writes to the source master.

Is there anything that prevents this race? If so, then it should be safe to automate step 4 to leave the source shards in a healthy state after SwitchWrites. If not, then the potential for this race to occur already exists since you could replace step 4 with, "a source master crashes and restarts," but automating the RefreshState would make the race more likely.

PrismaPhonic · 2020-10-19T21:26:08Z

@rohit-nayak-ps ?

rohit-nayak-ps · 2020-10-20T11:35:58Z

I will be discussing it with sugu today and will update you asap.

rohit-nayak-ps · 2020-10-22T06:12:03Z

The race you mentioned does exist. Here is how we are thinking of solving this:

In step 3 we do not clear out the DisableQueryService tablet control on the source
So at the end of the workflow we leave the source tablets with the DisableQueryService tablet control.
Then if rogue vtgates try to access the source tablets it won't succeed because source tabletserver won't be serving
On a reverse workflow we will need to enable the query service on the source. We will be adding a step while switching reads/writes to remove this record on all targets.
Since vreplication resides in the tablet manager and runs on the target the DisableQueryService will not affect vreplication
We do not need to call RefreshMasters on the source with this approach, if I understand correctly since the later steps only change the tablet controls on the target.

Thoughts?

enisoc · 2020-10-22T19:31:26Z

I agree this looks like it will fix the race, but I also want to be sure to address the original health check problem I was trying to fix when I noticed this race.

The problem is, the vitess operator (or any other system monitoring vitess) needs a way to ask the tablet, "Are you ok?" in a simple way. Currently we use /healthz for this. If we leave the query service disabled on the original source tablets, then /healthz will continue to report as unhealthy indefinitely. To the vitess operator (and other monitoring systems) this looks like a problem, but it really isn't.

So if we decide to leave the query service disabled to fix the race, I propose we should find a way to make /healthz report healthy when the query service is disabled for this particular reason (the shards are not in the serving partitioning). I'm not sure what's a clean way to do that.

rohit-nayak-ps · 2020-10-22T21:14:29Z

@deepthi, we were wondering about this yesterday: turns out the issue is that the operator uses /healthz which returns unhealthy just based on the fact that it is not serving. So the proposed solution will resolve the race but not the original issue we wanted to fix: spurious health check error reports from the source. I assume it will break other services if we change this logic based on DisableQueryService.

One solution might be to send an additional X-header in the HTTP/500 error and the operator takes that into account. But it sounds hacky : is there a cleaner solution? I guess this is a common issue others using the new Reshard workflows will encounter.

deepthi · 2020-11-02T23:30:45Z

Documenting offline discussions so far. A decision is still pending, requires a bit more investigation.

--
Let us start with what /healthz and /debug/health report today.
The description for healthz simply says "Health Check". whereas for debug/health we have "Query service healthcheck".
The current implementation of healthz requires the tablet to be serving (with the exception of SPARE, see Note), whereas debug/health requires that a simple query be successful (i.e. tests the whole path all the way to mysql and back).

However, if we go back to pre-7.0 (before the tabletmanager rewrite), healthz worked somewhat differently. (Credit to @enisoc for figuring this out)
It was checking tabletserver's isServing only if shouldBeServing was true. shouldBeServing was false when DisableQueryService was set. SPARE was handled by changeCallback setting DisableQueryService when it sees tablet_type of SPARE.

In summary, a SPARE tablet or a tablet with DisableQueryService set (for instance, because it is the source for a resharding that has completed) would actually report ok on the healthz endpoint.

So we can claim that changing healthz to report not ok for SPARE tablets or for tablets with DisableQueryService set was technically a breaking change in 7.0.

Where do we go from here?
The simple implementation of healthz that we had after the tabletmanager rewrite looked like this:

		if !tsv.sm.IsServing() {
			http.Error(w, "500 internal server error: vttablet is not serving", http.StatusInternalServerError)

We should fix it to behave the same as it used to pre-7.0, i.e. report ok if DisableQueryService is set or tablet is of type SPARE regardless of IsServing()
One way to accomplish this is to replace this with the following:

		if tsv.sm.reason == "" && !tsv.sm.IsServing() {

This is under the naive assumption that if reason is set to a non-empty value, then we expect that the tablet is not serving for a valid reason and so we don't care if it is in fact NOT_SERVING.
It is not clear that this assumption is actually correct.
As of today, reason can be one of:

TabletControl.DisableQueryService set
master tablet with filtered replication on
not a serving tablet type(%v)
demotion in progress
service stopped

For 1 and 2, we clearly want the tablet report ok. The rest are not so clear-cut.
On 3, BACKUP and RESTORE are also non-serving types. Should we report ok during a backup/restore?
What about 4 & 5? What are the conditions under which we would say service stopped?
Might we introduce other values of reason in future? What happens then?

The other option is to be conservative and make exceptions for the known conditions. Then we would do something like this:

		if tsv.sm.Target().TabletType != topodatapb.TabletType_SPARE &&
			tsv.sm.reason != "TabletControl.DisableQueryService set" &&
			tsv.sm.reason != "master tablet with filtered replication on" &&
			!tsv.sm.IsServing() {

This is unsatisfying and no more future proof. If we do introduce some other reason we will need to evaluate how it affects healthz.

I would like us to figure out exactly what we want to do for 3,4,5 in the list above before proceeding with a fix. We can update this issue once we have answers to these questions.

Note:
#6908 was an attempt to fix this for SPARE, hence the special handling in the code as it stands today, which reads:

		if tsv.sm.Target().TabletType != topodatapb.TabletType_SPARE && !tsv.sm.IsServing() {

sougou · 2020-11-02T23:51:09Z

StateManager has wantState and state. I think healthy could be:

If wantState is StateServing, then state must be StateServing.
If wantState is not StateServing, then we always declare ourselves healthy?

deepthi · 2020-12-01T04:53:01Z

StateManager has wantState and state. I think healthy could be:

If wantState is StateServing, then state must be StateServing.

If wantState is not StateServing, then we always declare ourselves healthy?

This looks like a clean and elegant solution. Thanks @sougou!

deepthi · 2020-12-01T04:55:30Z

@rohit-nayak-ps I'm not making #7090 close this issue, assuming that you still want to fix the race condition.
We can close it once that is also addressed.

mattlord · 2022-06-15T00:25:58Z

I'm closing this as stale for now as I'm not sure there's anything left to do here today in v14/v15.

Please let me know if I missed or misunderstood something and we can re-open it. Thanks!

deepthi assigned rohit-nayak-ps Oct 5, 2020

sougou added Component: VReplication P2 Type: Bug labels Nov 13, 2020

deepthi mentioned this issue Dec 1, 2020

Fix regression in /healthz #7090

Merged

mattlord closed this as completed Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resharding: SwitchWrites leaves source masters in an unhealthy state #6822

Resharding: SwitchWrites leaves source masters in an unhealthy state #6822

enisoc commented Oct 5, 2020 •

edited

Loading

PrismaPhonic commented Oct 19, 2020

rohit-nayak-ps commented Oct 20, 2020

rohit-nayak-ps commented Oct 22, 2020

enisoc commented Oct 22, 2020

rohit-nayak-ps commented Oct 22, 2020

deepthi commented Nov 2, 2020 •

edited

Loading

sougou commented Nov 2, 2020

deepthi commented Dec 1, 2020

deepthi commented Dec 1, 2020

mattlord commented Jun 15, 2022

Resharding: SwitchWrites leaves source masters in an unhealthy state #6822

Resharding: SwitchWrites leaves source masters in an unhealthy state #6822

Comments

enisoc commented Oct 5, 2020 • edited Loading

PrismaPhonic commented Oct 19, 2020

rohit-nayak-ps commented Oct 20, 2020

rohit-nayak-ps commented Oct 22, 2020

enisoc commented Oct 22, 2020

rohit-nayak-ps commented Oct 22, 2020

deepthi commented Nov 2, 2020 • edited Loading

sougou commented Nov 2, 2020

deepthi commented Dec 1, 2020

deepthi commented Dec 1, 2020

mattlord commented Jun 15, 2022

enisoc commented Oct 5, 2020 •

edited

Loading

deepthi commented Nov 2, 2020 •

edited

Loading