-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: don't emit unstable CommittedEntries #14413
Conversation
a00cc98
to
a13a33f
Compare
a13a33f
to
e905e94
Compare
This comment was marked as outdated.
This comment was marked as outdated.
028ac1d
to
acb7e1d
Compare
Rerun the e2e flake. This PR is still failing DCO, please run |
1e6d038
to
ba04f6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes sense to me. It feels like the "proper" solution because it simplifies the Ready
contract and avoids the need for special-casing of single-peer deployments. After this change, entries in the CommittedEntries
slice are always committed.
I actually made very similar changes in a prototype (nvanbenschoten@1d1fa32) intended to address #12257 / cockroachdb/cockroach#17500. These kinds of extensions will be easier to make if the Raft leader doesn't immediately consider an entry to be durable upon appending to its own unstable log (in raft.appendEntry
). Things are cleaner if, instead, the durability status is maintained either implicitly in raft.advance
or explicitly through a different (opt-in) API if clients want the flexibility to sync their log asynchronously.
From the perspective of CRDB's use of Raft, we weren't susceptible to the bug in #14370 because log application always came after log appending in the Ready processing state machine loop. However, we were aware of this idiosyncrasy and had to work around it when acking proposals ahead of application, see: https://github.com/cockroachdb/cockroach/blob/ce55e1b46604401f6085a6ece1355970da94600a/pkg/kv/kvserver/replica_raft.go#L834-L844
The comment there is interesting because it references a second case where Entries
and CommittedEntries
can overlap — that being when a follower in a multi-node replication group is catching up after falling behind. In those cases, the overlapping entries are already committed, so there's no durability risk from applying CommittedEntries before appending Entries (like etcdserver
does). However, that still means that it's possible to have applied an entry that is not durably stored locally in the Raft log. I don't know if that could cause issues for etcd or whether entry application is idempotent.
The production code change looks simple & good, but the test code change is too big. This PR almost follows the same logic as #14411, but this PR is much bigger, even excluding the new benchmark test case. Another comment... Have you compared the performance as compared to the |
Thanks for the review, @ahrtr. I'm not responding to the inline comments yet since you seem to have general concerns about the size of the change. I think it comes down to the discussion in #14370 (comment). You're fixing
In my book, a special case needs a compelling reason. Special cases add complexity, and this complexity has to be made up for by some benefit. I cannot see that benefit here. Some of the tests were pretty confused about whether they wanted to make statements about entry append or entry commit or entry apply. This is because the problem we're fixing made these things all the same in a single-node cluster in the same cycle. I made sure to update the tests to be more robust (especially the
Could you elaborate on this? Two cases = more complexity. In this PR, the empty entry is just an entry; why make it more complex? There should be fewer subtle edge cases now - namely none at all - and that is what we are after. What am I missing?
We can agree to disagree here - *raft is the thing that produces |
Thanks for the feedback.
I just kept the legacy behavior on empty entry so as to keep the PR(14411) as simple&safe as possible, and it's the reason why the size of 14411 is much smaller than this one. I agree that regarding the empty entry as a special case isn't good, although it was consistent with the legacy behavior.
Thanks for the clarification.
Partially agreed. We also need to make sure the quality of test code. The reason why I raised the concern on the size on the test code is just to understand the reason of the bigger size. Good to confirmed that one of the major reasons is due to the removal of the special case on empty entry. Overall looks good to me. Please,
|
Thanks @ahrtr. I'm travelling this week, so there won't be activity on this PR until next week but I plan to return to it the week after. I am also planning to validate the perf impact of this change with the etcd suite as you suggested, and also against CRDB. |
ba04f6c
to
df4e222
Compare
This picks up etcd-io/etcd#14413. Closes cockroachdb#87264. Release note: None
``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Closes cockroachdb#87264. Release note: None
``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Closes cockroachdb#87264. Release note: None
``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Closes cockroachdb#87264. Release note: None
``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Closes cockroachdb#87264. Release note: None
This commit introduces an intermediate state that delays the acknowledgement of a node's self-vote during an election until that vote has been durably persisted (i.e. on the next call to Advance). This change can be viewed as the election counterpart to etcd-io#14413. This is an intermediate state that limits code movement for the rest of the async storage writes change. Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Closes cockroachdb#87264. Release note: None
89632: go.mod: bump raft r=nvanbenschoten a=tbg ``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Compared single-node performance on gceworker via ```bash #!/bin/bash set -euxo pipefail pkill -9 cockroach || true rm -rf cockroach-data cr=./cockroach-$1 $cr start-single-node --background --insecure $cr workload init kv $cr workload run kv --splits 100 --max-rate 2000 --duration 10m --read-percent 0 --min-block-bytes 10 --max-block-bytes 10 | tee $1.txt ``` ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 600.0s 0 1199604 1999.3 6.7 7.1 10.0 11.0 75.5 write #master 600.0s 0 1199614 1999.4 6.8 7.1 10.0 11.0 79.7 write #PR ``` Closes #87264. - [x] [make it build](#88985 (comment)) - [x] remove the maxIndex param and handling from Task.AckCommittedEntriesBeforeApplication - [x] check that single node write latencies don't regress Release note: None 91117: sql: reduce the overhead of EXPLAIN ANALYZE r=yuzefovich a=yuzefovich In order to propagate the execution stats across the distributed query plan we use the tracing infrastructure, where each stats object is added as "structured metadata" to the trace. Thus, whenever we're collecting the exec stats for a statement, we must enable tracing. Previously, in many cases we would enable it at the highest verbosity level which has non-trivial overhead. In some cases this was an overkill (e.g. in `EXPLAIN ANALYZE` we don't really care about the trace containing all of the gory details - we won't expose it anyway), so this is now fixed by using the less verbose "structured" verbosity level. As a concrete example of the difference: for a stmt that without `EXPLAIN ANALYZE` takes around 190ms, with `EXPLAIN ANALYZE` it would previously run for about 1.8s and now it takes around 210ms. This required some minor changes to the row-by-row outbox and router setups to collect thats even if the recording is not verbose. Addresses: #90739. Epic: None Release note (performance improvement): The overhead of running `EXPLAIN ANALYZE` and `EXPLAIN ANALYZE (DISTSQL)` has been significantly reduced. The overhead of `EXPLAIN ANALYZE (DEBUG)` didn't change. 91119: roachprod: improve error in ParallelE r=smg260 a=tbg Prior to this commit, the error's stack trace did not link back to the caller of `ParallelE`. Now it does. Epic: none Release note: None 91126: dev: allow whitespace separated regexps for testlogic files r=ajwerner a=ajwerner This was a feature of `make testlogic` and it was liked. Fixes #91125 Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Andrew Werner <awerner32@gmail.com>
This commit introduces an intermediate state that delays the acknowledgement of a node's self-vote during an election until that vote has been durably persisted (i.e. on the next call to Advance). This change can be viewed as the election counterpart to etcd-io#14413. This is an intermediate state that limits code movement for the rest of the async storage writes change. Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This commit introduces an intermediate state that delays the acknowledgement of a node's self-vote during an election until that vote has been durably persisted (i.e. on the next call to Advance). This change can be viewed as the election counterpart to etcd-io#14413. This is an intermediate state that limits code movement for the rest of the async storage writes change. Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Fixes #14370.
When run in a single-voter configuration, prior to this PR
raft would emit
HardState
s that would emit a proposedEntry
simultaneously in
CommittedEntries
andEntries
.To be correct, this requires users of the raft library to enforce an
ordering between appending to the log and notifying the client about
CommittedEntries
also present inEntries
. This was easy to miss.Walk back this behavior to arrive at a simpler contract: what's
emitted in
CommittedEntries
is truly committed, i.e. presentin stable storage on a quorum of voters.
This in turn pessimizes the single-voter case: rather than fully
handling an
Entry
in just oneReady
, now two are required,and in particular one has to do extra work to save on allocations.
We accept this as a good tradeoff, since raft primarily serves
multi-voter configurations.
Suggested review plan:
raft: don't emit unstable CommittedEntries
first to get an idea of the gist of the changeraft: directly update leader in advance
for an alternative solution that "inlines" the relevant bits ofr.Step
instead.raft: add BenchmarkRawNode
and at the benchmark results inraft: benchmark results
raft: always mark leader as RecentActive
was included here since otherwise the outputs of the two possible solutions differ in sometimes failing to track the leader as recently active.