-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mvcc: fully concurrent read #10523
mvcc: fully concurrent read #10523
Conversation
Benchmark Setup: Result: Write response time histogram: Long read response time histogram: -This PR Write response time histogram: Long read response time histogram: |
Please ignore the conflict, this PR needs a rebase to #10506. |
@jingyih Have you done a throughput benchmark? Copying the cache might affect throughput but not the latency. |
We do plan to get more benchmark results. Do you have suggestion on what kind of workload we should use? Copying the cache will affect the system most if the cache size is large, i.e., when the system is write heavy. We plan to design a workload to simulate the worst case. We are also thinking about adding metrics (only for debugging purpose) to measure cache size and time duration of making the copy. |
Edit (5/9/2019): this statement is not correct. |
6bff469
to
9da1f23
Compare
9da1f23
to
a73fb85
Compare
Rebased to master. |
Throughput comparison: short reads
(Note that the requested ops per second is higher than actual served ops per second.) MasterSummary: Response time histogram: Latency distribution: Master + PRSummary: Response time histogram: Latency distribution: |
Throughput comparison: writes
(Note that the requested ops per second is higher than actual served ops per second.) MasterSummary: Response time histogram: Latency distribution: Master + PRSummary: Response time histogram: Latency distribution: |
Throughput comparison: short reads + writes
(Note that the requested ops per second is higher than actual served ops per second.) MasterReadSummary: Response time histogram: Latency distribution: WriteSummary: Response time histogram: Latency distribution: Master + PRReadSummary: Response time histogram: Latency distribution: WriteSummary: Response time histogram: Latency distribution: |
Throughput comparison: long reads + writes
(Note that the requested ops per second is higher than actual served ops per second.) MasterLong ReadSummary: Response time histogram: Latency distribution: WriteSummary: Response time histogram: Latency distribution: Master + PRLong ReadSummary: Response time histogram: Latency distribution: WriteSummary: Response time histogram: Latency distribution: |
@xiang90 |
Throughput comparison: long reads + (short) reads + writes
MasterLong ReadSummary: Response time histogram: Latency distribution: (Short) ReadSummary: Response time histogram: Latency distribution: WriteSummary: Response time histogram: Latency distribution: Master + PRLong ReadSummary: Response time histogram: Latency distribution: (Short) ReadSummary: Response time histogram: Latency distribution: WriteSummary: Response time histogram: Latency distribution: |
GCE 5k node scalability test resultSig-scalability lead @wojtek-t helped me running the test and provided a quick analysis.
Result compared to baselineUsing test run 1130383116527996934 as baseline. It is right before 1130745634945503235.
|
The P99 of POST pod is still bad. Probably worth some effort to dig. |
My understanding is that there was a scalability regression back then, the scalability team spent couple weeks trying to root cause it. (Not sure about the current status) This test was ran during their debugging effort. P99 of POST pod is bad but not due to this PR. |
119134b
to
19ee3a9
Compare
Add TestConcurrentReadTxAndWrite which creates random reads and writes, and ensures reads always see latest writes.
19ee3a9
to
2a9320e
Compare
yes. that is what i would expect. |
I think locally it takes between 1 to 2 seconds to finish. On Travis I did not see timing for each individual unit test. I compared the finishing time of mvcc package with and without the new test, it is comparable. |
LGTM. maybe @jpbetz wants to give this a final review. |
Thanks! @xiang90 |
mvcc/backend/backend.go
Outdated
@@ -493,6 +518,7 @@ func (b *backend) begin(write bool) *bolt.Tx { | |||
db := tx.DB() | |||
atomic.StoreInt64(&b.size, size) | |||
atomic.StoreInt64(&b.sizeInUse, size-(int64(db.Stats().FreePageN)*int64(db.Info().PageSize))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since the db.Stats()
call acquires a read lock, only call it once here (instead of once for FreePageN and once for OpenTxN) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
mvcc/backend/batch_tx.go
Outdated
@@ -323,6 +319,17 @@ func (t *batchTxBuffered) unsafeCommit(stop bool) { | |||
} | |||
} | |||
|
|||
func waitAndRollback(tx *bolt.Tx, wg *sync.WaitGroup, lg *zap.Logger) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a receiver function of batchTxBuffered? The parameters don't all appear necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tx
and wg
need to be passes into the function. batchTxBuffered
's tx
and wg
will be reset / re-created immediate after this function call, whereas this function waits until all ongoing reads using the input tx
is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this a receiver and moving the logger out of the params might make that even more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean something like func (lg *zap.Logger) waitAndRollback(tx *bolt.Tx, wg *sync.WaitGroup)
? I have mixed feeling about this. By convention (in this repo), zap logger is passed to functions as input parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more like func (t *batchTxBuffered) waitAndRollback(tx *bolt.Tx, wg *sync.WaitGroup)
so that the only params that need to be passed are the ones that are semantically important to the call. But this is minor,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just inline the function in unsafeCommit
? it doesn't seem to benefit from being separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
mvcc/backend/read_tx.go
Outdated
|
||
func (rt *concurrentReadTx) Lock() {} | ||
func (rt *concurrentReadTx) Unlock() {} | ||
func (rt *concurrentReadTx) RLock() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised we don't do anything in RLock()
. Should we try to adhere to the lock contract more closely (i.e. move logic from ConcurrentReadTx to here) ? If not we should clearly document what's going on.
Should Lock()
and Unlock()
be noops? Alternatives would be to delegate to RLock()
/RUnock()
or to panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is tricky. ConcurrentReadTx() holds a lock when it is being created (and everything needs to happen when holding that lock, so cannot really move part of the logic to a separate function). After creation, there is no need / concept of locking and unlocking. So these functions end up being no-op. They only exist so that *concurrentReadTx implements ReadTx interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some comments with what you just said would certainly make it easier to understand.
The violation of the ReadTx interface guarantees do make for some odd behavior, e.g.:
tx := ConcurrentReadTx()
tx.RLock()
// do something
tx.RUnlock()
tx.RLock()
// do something
tx.RUnlock()
Would certainly not do what a developer might expect..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I am not happy with the current state. The current implementation of ConcurrentReadTx is compromised in readability because I wanted to do minimum change to various other existing interfaces. (Tried to define a new interface for concurrent read Tx, it caused ripple effects and the end result is kind of messy). Fortunately ReadTx interface is used internally in mvcc backend. Developer normally do not need to interact with this level. Instead, the mvcc store level transaction interface [1] should be enough (which is kept unchanged).
But I will add some comments try to clarify the implementation.
[1]
Lines 55 to 57 in 0de9b8a
// TxnRead represents a read-only transaction with operations that will not | |
// block other read transactions. | |
type TxnRead interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's loop back on this interface later. The comments should be sufficient for now.
2ec1a25
to
a823555
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
From Travis CI:
Some whitespace difference? |
a823555
to
55066eb
Compare
@jpbetz Updated. PTAL. |
Great work! LGTM |
@jingyih can you check the CI failures? |
Travis CI failed with: === RUN TestLeasingGetWithOpts |
The stack trace looks basically the same as the known flaky tests "TestLeasingXXX" listed in #10700. I don't think it is related to this PR. |
semaphoreci failed with a known flaky test TestCtlV3AuthFromKeyPerm. |
This is awesome to see landing before 3.4 release - great work @jingyih and thanks a lot for making that happen to reviewers. Just to answer one thing (sorry, 3 weeks on vacation)
@xiang90 - note that those aren't etcd latencies, but apiserver latencies. The latencies are affected mostly by apiserver-related stuff. That experimental image obviously doesn't help for what apiserver is doing, thus the number are not that small as we would like them to see :) |
This PR solves "large read blocking write" issue (#10525) in mvcc/backend, as an alternative approach to #9384. The idea is to make ALL reads concurrent, therefore avoiding the need to predict and postpone the large reads.
For 'full story', please refer to the etcd: Fully Concurrent Reads Design Proposal.
TL;DR
In presence of long running reads, this PR increases the throughput of other requests (short reads and writes) by about 70%, reduces P99 and P999 latency by about 90%. In other benchmark cases where there are no long running reads, performance w/ and w/o this PR is comparable. (see later section for detailed results)
What does this PR solve
Concurrency bottleneck in current backend implementation
Any ongoing kvstore read transaction will hold a RWMutex read lock [1], which blocks any modification to backend read buffer. There are two places where an backend write transaction needs to acquire write lock and make modifications to read buffer:
Therefore a long read transaction can potentially block write and the following reads. Note that this blocking is necessary to provide linearizable guarantee, simply removing the locking will not work. Backend needs to make sure the combined view of read buffer + bolddb Tx is always consistent.
Ref:
[1]
etcd/mvcc/kvstore_txn.go
Lines 32 to 34 in 886d30d
etcd/mvcc/kvstore_txn.go
Lines 53 to 54 in 886d30d
[2]
etcd/mvcc/backend/batch_tx.go
Lines 278 to 280 in 886d30d
[3]
etcd/mvcc/backend/batch_tx.go
Lines 302 to 304 in 886d30d
How this PR addresses the bottleneck
Benchmark result (PR rebased on 5/8/2019)
Setup
Generated 100K keys
Throughput benchmark
1. Short reads
Result is comparable.
2. Writes
Result is comparable.
#10523 (comment)
3. Short reads + Writes
Result is comparable, this PR is slight worse in P99 and P999 read latency distribution.
This is likely to be the worst case scenario for this PR. Since the server is overloaded by writes, the backend read buffer is loaded with pending writes.
#10523 (comment)
4. Long reads + Writes
Result: this PR increased write throughput by 69.5%, reduced P99 write latency by 90.8%, reduced P999 write latency by 85.3%.
This is the scenario this PR tries to improve.
#10523 (comment)
5. Long reads + (short) Reads + Writes
Result: In presence of long expensive reads, short read throughput is increased by 77.5%, P99 read latency is reduced by 93.0%, P999 read latency is reduced by 82.9%; and write throughput is increased by 85.3%, P99 write latency is reduced by 94.4%, P999 write latency is reduced by 83.9%.
#10523 (comment)
Kubernetes scalability test result
Posted here: #10523 (comment)