-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication #7523
[reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication #7523
Conversation
This was in the old implementation, and was overlooked in the port to an encapsulated struct. I've added tests as penance. Signed-off-by: Andrew Mason <amason@slack-corp.com>
@ajm188 before digging into this are the unit test failures concerning? They feel topical given it's testing ERS even though it's not obvious if all ERS tests are related to this change given it's a different impl |
It's definitely worth looking into those, but they may be red herrings:
|
Hmmm, @setassociative looking again at #7464, in particular here, it seems this block I've added actually didn't exist anymore, and was removed in 6449 (I was looking at Anyway, the reason that I noticed this, was when writing tests for adding ERS to the new I'm confident this is the "issue" I saw, but I'm not confident it's actual issue vs me accidentally creating an impossible situation in my test setup (but we should add a guard against nil input so we don't segfault during ERS either way), or how to best handle this. I think what we actually want is to just skip the MASTERs, because they aren't applying relay logs since they are not replicating, so something like: func (erp *EmergencyReparnter) waitForAllRelayLogsToApply(/* blah blah blah */) (/* blah blah*/) {
errCh := make(chan error)
defer close(errCh)
groupCtx, groupCancel := context.WithTimeout(ctx, opts.WaitReplicasTimeout)
defer groupCancel()
replicaCount := 0
for candidate := range validCandidates {
tablet, ok := tabletMap[candidate]
// handle !ok and return error
if tablet.Type == topodatapb.TabletType_MASTER {
// log
continue
}
go func(alias string) {
var err error
defer func() { errCh <- err }()
err = WaitForRelayLogsToApply(groupCtx, erp.tmc, tabletMap[alias], statusMap[alias])
}(candidate)
replicaCount++
}
errgroup := concurrency.ErrorGroup{
NumGoroutines: replicaCount,
NumRequiredSuccesses: replicaCount,
NumAllowedErrors: 0,
}
rec := errgroup.Wait(groupCancel, errCh)
// rest of the function unchanged
} cc @deepthi @PrismaPhonic to double-check my read here |
The design is VERY intentional upon not assuming who masters are. This allows us to handle a host of edge cases where a tablet thinks it's master when it's really not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm blocking on this until we can discuss it further, because it pushes against the design by trying to know who the master is. That was a very important part of our re-design (to intentionally not try to assume who the master was)
I am well aware of that design consideration. In my most recent comment, I point out that the change currently in this PR is not the change we want:
The actual issue, as I attempted to describe in my most recent comment, is that a MASTER tablet will return I agree that not making assumptions about who the primary is remains an important aspect of the design, but this needs to be fixed, as crashing during an ERS is also quite dangerous to health of the cluster the ERS is trying to salvage. |
…eration" This reverts commit c5bbcc7. Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
…ng StopReplication phase Signed-off-by: Andrew Mason <amason@slack-corp.com>
I've pushed, in order:
Assuming we're all on board with this, I'll update the PR description and dig in to any other tests that broke as a result; and, again, assuming everything's good, when merging the final version of this, rebase to remove the original commit + revert of that commit, to clean up the git branch. |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a much better approach. Thank you!
OP vitessio#7523 This basically protects from trying to catch up on replication on hosts that are likely not replicating. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
OP vitessio#7523 This basically protects from trying to catch up on replication on hosts that are likely not replicating. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
OP vitessio#7523 This basically protects from trying to catch up on replication on hosts that are likely not replicating. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
Description
This was either in the previous previous implementation, or not at all; I haven't gone back far enough to check.
I've added tests as penance.
The root issue is that a MASTER tablet will return mysql.ErrNotReplica from tmc.StopReplicationAndGetStatus, which will cause us to add an entry for that tablet in the primaryStatusMap and not the statusMap, in StopReplicationAndBuildStatusMap. Then, when we attempt to wait for all valid candidates to apply their relay logs (which can include MASTER tablets), we end up passing a nil Status to WaitForRelayLogsToApply, causing the segfault.
Related Issue(s)
EmergencyReparentShard
logic to dedicated struct and add unit tests #7464Checklist
EmergencyReparentShard
logic to dedicated struct and add unit tests #7464 isn't released, so no need to backport)Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: