Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: don't emit unstable CommittedEntries #14413

Merged
merged 36 commits into from
Sep 22, 2022
Merged

Conversation

tbg
Copy link
Contributor

@tbg tbg commented Sep 1, 2022

Fixes #14370.

When run in a single-voter configuration, prior to this PR
raft would emit HardStates that would emit a proposed Entry
simultaneously in CommittedEntries and Entries.

To be correct, this requires users of the raft library to enforce an
ordering between appending to the log and notifying the client about
CommittedEntries also present in Entries. This was easy to miss.

Walk back this behavior to arrive at a simpler contract: what's
emitted in CommittedEntries is truly committed, i.e. present
in stable storage on a quorum of voters.

This in turn pessimizes the single-voter case: rather than fully
handling an Entry in just one Ready, now two are required,
and in particular one has to do extra work to save on allocations.

We accept this as a good tradeoff, since raft primarily serves
multi-voter configurations.


Suggested review plan:

  • Look at raft: don't emit unstable CommittedEntries first to get an idea of the gist of the change
  • Look at raft: directly update leader in advance for an alternative solution that "inlines" the relevant bits of r.Step instead.
  • The test suites pass with either solution, so the two solutions are functionally equivalent to the degree that we're exercising them.
  • Look at raft: add BenchmarkRawNode and at the benchmark results in raft: benchmark results
  • If there are substantial concerns and/or comments, stop here and sort them out with me.
  • Then slog through the remaining commits, which are all going to be squashed before the merge, and fix up the tests one-by-one.
  • Note: raft: always mark leader as RecentActive was included here since otherwise the outputs of the two possible solutions differ in sometimes failing to track the leader as recently active.

@codecov-commenter

This comment was marked as outdated.

@tbg tbg force-pushed the raft-single-voter branch 2 times, most recently from 028ac1d to acb7e1d Compare September 7, 2022 09:09
@serathius
Copy link
Member

Rerun the e2e flake. This PR is still failing DCO, please run git rebase origin/main --signoff and push head with force git push origin HEAD:raft-single-voter -f.

@tbg tbg force-pushed the raft-single-voter branch 3 times, most recently from 1e6d038 to ba04f6c Compare September 8, 2022 10:28
@tbg tbg requested a review from ptabor September 8, 2022 10:43
@tbg tbg assigned ahrtr Sep 8, 2022
@tbg tbg requested a review from ahrtr September 8, 2022 10:43
@tbg tbg unassigned ahrtr Sep 8, 2022
@tbg tbg marked this pull request as ready for review September 8, 2022 10:46
Copy link
Contributor

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me. It feels like the "proper" solution because it simplifies the Ready contract and avoids the need for special-casing of single-peer deployments. After this change, entries in the CommittedEntries slice are always committed.

I actually made very similar changes in a prototype (nvanbenschoten@1d1fa32) intended to address #12257 / cockroachdb/cockroach#17500. These kinds of extensions will be easier to make if the Raft leader doesn't immediately consider an entry to be durable upon appending to its own unstable log (in raft.appendEntry). Things are cleaner if, instead, the durability status is maintained either implicitly in raft.advance or explicitly through a different (opt-in) API if clients want the flexibility to sync their log asynchronously.

From the perspective of CRDB's use of Raft, we weren't susceptible to the bug in #14370 because log application always came after log appending in the Ready processing state machine loop. However, we were aware of this idiosyncrasy and had to work around it when acking proposals ahead of application, see: https://github.com/cockroachdb/cockroach/blob/ce55e1b46604401f6085a6ece1355970da94600a/pkg/kv/kvserver/replica_raft.go#L834-L844

The comment there is interesting because it references a second case where Entries and CommittedEntries can overlap — that being when a follower in a multi-node replication group is catching up after falling behind. In those cases, the overlapping entries are already committed, so there's no durability risk from applying CommittedEntries before appending Entries (like etcdserver does). However, that still means that it's possible to have applied an entry that is not durably stored locally in the Raft log. I don't know if that could cause issues for etcd or whether entry application is idempotent.

raft/raft.go Outdated Show resolved Hide resolved
raft/raft.go Show resolved Hide resolved
raft/raft.go Outdated Show resolved Hide resolved
raft/node_util_test.go Outdated Show resolved Hide resolved
raft/rawnode_test.go Outdated Show resolved Hide resolved
@ahrtr
Copy link
Member

ahrtr commented Sep 9, 2022

The production code change looks simple & good, but the test code change is too big. This PR almost follows the same logic as #14411, but this PR is much bigger, even excluding the new benchmark test case.

Another comment... Have you compared the performance as compared to the main branch using the benchmark tool? I suggest to compare both single-node cluster and multi-node cluster. FYI. #14394 (comment) and #14394 (comment)

@tbg
Copy link
Contributor Author

tbg commented Sep 9, 2022

Thanks for the review, @ahrtr. I'm not responding to the inline comments yet since you seem to have general concerns about the size of the change.

I think it comes down to the discussion in #14370 (comment).

You're fixing *node, I'm fixing raft. So I have to update more tests.
Also, you keep the old behavior for empty entries for reasons that aren't clear to me. I think this prevents you from having to adjust more tests.

Usually empty entries will only be generated in multi-node cluster. Since multi-node clusters do not have this issue, so I think it makes sense to keep the optimization (update the pr.Match immediately after appending it).

In my book, a special case needs a compelling reason. Special cases add complexity, and this complexity has to be made up for by some benefit. I cannot see that benefit here.

Some of the tests were pretty confused about whether they wanted to make statements about entry append or entry commit or entry apply. This is because the problem we're fixing made these things all the same in a single-node cluster in the same cycle.

I made sure to update the tests to be more robust (especially the TestNode.* tests are pretty annoying when they fail, prior to my improvements) and to not get too crufty. That is some extra work, but this PR should be judged by its changes to production code, not the corresponding updates in tests.

  1. If you remove the optimization (mentioned above) so as to keep the behavior consistent for both empty and non-empty entry, then likely there are more subtle edge cases you need to handle, accordingly it might further complicate the raft package; and they are exactly what I have been trying to avoid.

Could you elaborate on this? Two cases = more complexity. In this PR, the empty entry is just an entry; why make it more complex? There should be fewer subtle edge cases now - namely none at all - and that is what we are after. What am I missing?

  1. Actually [Fourth solution] Fix the potential data loss for clusters with only one member (raft layer change) #14411 doesn't almost change anything on the raft protocol, which I think is better. But I agree it has some impact on the direct users on RawNode.

We can agree to disagree here - *raft is the thing that produces Ready and the semantics of Ready are what need to be fixed. We could fix it in *RawNode but RawNode is intentionally a thin interface wrapper around *raft (similar to *node is just a goroutine around *RawNode). Spreading implementation details between them is not going to make things clearer. We shouldn't start to bolt on logic outside of *raft because we're scared of changing *raft.

@ahrtr
Copy link
Member

ahrtr commented Sep 9, 2022

Thanks for the feedback.

Could you elaborate on this? Two cases = more complexity. In this PR, the empty entry is just an entry; why make it more complex? There should be fewer subtle edge cases now - namely none at all - and that is what we are after. What am I missing?

I just kept the legacy behavior on empty entry so as to keep the PR(14411) as simple&safe as possible, and it's the reason why the size of 14411 is much smaller than this one. I agree that regarding the empty entry as a special case isn't good, although it was consistent with the legacy behavior.

Some of the tests were pretty confused about whether they wanted to make statements about entry append or entry commit or entry apply. This is because the problem we're fixing made these things all the same in a single-node cluster in the same cycle.

Thanks for the clarification.

That is some extra work, but this PR should be judged by its changes to production code, not the corresponding updates in tests.

Partially agreed. We also need to make sure the quality of test code. The reason why I raised the concern on the size on the test code is just to understand the reason of the bigger size. Good to confirmed that one of the major reasons is due to the removal of the special case on empty entry.

Overall looks good to me. Please,

  1. Consider to resolve the inline minor comments;
  2. Consider to squash the commits into a couple of commits if possible.

@tbg
Copy link
Contributor Author

tbg commented Sep 12, 2022

Thanks @ahrtr. I'm travelling this week, so there won't be activity on this PR until next week but I plan to return to it the week after. I am also planning to validate the perf impact of this change with the etcd suite as you suggested, and also against CRDB.

@tbg tbg force-pushed the raft-single-voter branch from ba04f6c to df4e222 Compare September 19, 2022 08:34
@tbg tbg deleted the raft-single-voter branch September 23, 2022 07:31
tbg added a commit to tbg/cockroach that referenced this pull request Sep 29, 2022
This picks up etcd-io/etcd#14413.

Closes cockroachdb#87264.

Release note: None
@tbg tbg mentioned this pull request Oct 10, 2022
3 tasks
tbg added a commit to tbg/cockroach that referenced this pull request Oct 10, 2022
```
go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f

go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead.
go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0
go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
```

This picks up

- etcd-io/etcd#14413
- etcd-io/etcd#14538

Closes cockroachdb#87264.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Oct 13, 2022
```
go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f

go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead.
go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0
go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
```

This picks up

- etcd-io/etcd#14413
- etcd-io/etcd#14538

Closes cockroachdb#87264.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Oct 14, 2022
```
go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f

go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead.
go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0
go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
```

This picks up

- etcd-io/etcd#14413
- etcd-io/etcd#14538

Closes cockroachdb#87264.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Oct 17, 2022
```
go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f

go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead.
go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0
go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
```

This picks up

- etcd-io/etcd#14413
- etcd-io/etcd#14538

Closes cockroachdb#87264.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/etcd that referenced this pull request Oct 27, 2022
This commit introduces an intermediate state that delays the
acknowledgement of a node's self-vote during an election until
that vote has been durably persisted (i.e. on the next call to
Advance). This change can be viewed as the election counterpart
to etcd-io#14413.

This is an intermediate state that limits code movement for the
rest of the async storage writes change.

Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
tbg added a commit to tbg/cockroach that referenced this pull request Nov 2, 2022
```
go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f

go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead.
go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0
go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
```

This picks up

- etcd-io/etcd#14413
- etcd-io/etcd#14538

Closes cockroachdb#87264.

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Nov 3, 2022
89632: go.mod: bump raft r=nvanbenschoten a=tbg

```
go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f

go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead.
go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0
go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
```

This picks up

- etcd-io/etcd#14413
- etcd-io/etcd#14538

Compared single-node performance on gceworker via

```bash
#!/bin/bash
set -euxo pipefail
pkill -9 cockroach || true
rm -rf cockroach-data
cr=./cockroach-$1
$cr start-single-node --background --insecure
$cr workload init kv
$cr workload run kv --splits 100 --max-rate 2000 --duration 10m --read-percent 0 --min-block-bytes 10 --max-block-bytes 10 | tee $1.txt
```

```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0        1199604         1999.3      6.7      7.1     10.0     11.0     75.5  write #master
  600.0s        0        1199614         1999.4      6.8      7.1     10.0     11.0     79.7  write #PR
```

Closes #87264.

- [x] [make it build](#88985 (comment))
- [x] remove the maxIndex param and handling from Task.AckCommittedEntriesBeforeApplication
- [x] check that single node write latencies don't regress

Release note: None


91117: sql: reduce the overhead of EXPLAIN ANALYZE r=yuzefovich a=yuzefovich

In order to propagate the execution stats across the distributed query plan we use the tracing infrastructure, where each stats object is added as "structured metadata" to the trace. Thus, whenever we're collecting the exec stats for a statement, we must enable tracing. Previously, in many cases we would enable it at the highest verbosity level which has non-trivial overhead. In some cases this was an overkill (e.g. in `EXPLAIN ANALYZE` we don't really care about the trace containing all of the gory details - we won't expose it anyway), so this is now fixed by using the less verbose "structured" verbosity level. As a concrete example of the difference: for a stmt that without `EXPLAIN ANALYZE` takes around 190ms, with `EXPLAIN ANALYZE` it would previously run for about 1.8s and now it takes around 210ms.

This required some minor changes to the row-by-row outbox and router
setups to collect thats even if the recording is not verbose.

Addresses: #90739.

Epic: None

Release note (performance improvement): The overhead of running `EXPLAIN ANALYZE` and `EXPLAIN ANALYZE (DISTSQL)` has been significantly reduced. The overhead of `EXPLAIN ANALYZE (DEBUG)` didn't change.

91119: roachprod: improve error in ParallelE r=smg260 a=tbg

Prior to this commit, the error's stack trace did not link back
to the caller of `ParallelE`. Now it does.

Epic: none
Release note: None


91126: dev: allow whitespace separated regexps for testlogic files r=ajwerner a=ajwerner

This was a feature of `make testlogic` and it was liked.

Fixes #91125

Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/etcd that referenced this pull request Nov 10, 2022
This commit introduces an intermediate state that delays the
acknowledgement of a node's self-vote during an election until
that vote has been durably persisted (i.e. on the next call to
Advance). This change can be viewed as the election counterpart
to etcd-io#14413.

This is an intermediate state that limits code movement for the
rest of the async storage writes change.

Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/etcd that referenced this pull request Dec 12, 2022
This commit introduces an intermediate state that delays the
acknowledgement of a node's self-vote during an election until
that vote has been durably persisted (i.e. on the next call to
Advance). This change can be viewed as the election counterpart
to etcd-io#14413.

This is an intermediate state that limits code movement for the
rest of the async storage writes change.

Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Durability API guarantee broken in single node cluster
5 participants