mvcc: fully concurrent read #10523

jingyih · 2019-03-06T03:36:07Z

This PR solves "large read blocking write" issue (#10525) in mvcc/backend, as an alternative approach to #9384. The idea is to make ALL reads concurrent, therefore avoiding the need to predict and postpone the large reads.
For 'full story', please refer to the etcd: Fully Concurrent Reads Design Proposal.

TL;DR

In presence of long running reads, this PR increases the throughput of other requests (short reads and writes) by about 70%, reduces P99 and P999 latency by about 90%. In other benchmark cases where there are no long running reads, performance w/ and w/o this PR is comparable. (see later section for detailed results)

What does this PR solve

Concurrency bottleneck in current backend implementation

Any ongoing kvstore read transaction will hold a RWMutex read lock [1], which blocks any modification to backend read buffer. There are two places where an backend write transaction needs to acquire write lock and make modifications to read buffer:

when a kvstore write transaction ends, write buffer (i.e., the changes made by this kvstore write transaction) needs to be merged into read buffer. [2]
when a batch interval ends and changes are actually committed, current read buffer and boltdb Tx needs to be reset / re-create for next batch interval. [3]

Therefore a long read transaction can potentially block write and the following reads. Note that this blocking is necessary to provide linearizable guarantee, simply removing the locking will not work. Backend needs to make sure the combined view of read buffer + bolddb Tx is always consistent.

Ref:
[1]

etcd/mvcc/kvstore_txn.go

Lines 32 to 34 in 886d30d

    
           func (s *store) Read() TxnRead { 
        
           	s.mu.RLock() 
        
           	tx := s.b.ReadTx()

etcd/mvcc/kvstore_txn.go

Lines 53 to 54 in 886d30d

    
           func (tr *storeTxnRead) End() { 
        
           	tr.tx.RUnlock()

[2]

etcd/mvcc/backend/batch_tx.go

Lines 278 to 280 in 886d30d

    
           t.backend.readTx.Lock() // blocks txReadBuffer for writing. 
        
           t.buf.writeback(&t.backend.readTx.buf) 
        
           t.backend.readTx.Unlock()

[3]

etcd/mvcc/backend/batch_tx.go

Lines 302 to 304 in 886d30d

    
           t.backend.readTx.Lock() 
        
           t.unsafeCommit(stop) 
        
           t.backend.readTx.Unlock()

How this PR addresses the bottleneck

In the beginning of a kvstore read transaction, a copy of the current backend read buffer is created. Any range ops in this transaction will read from the copy and boltdb Tx. Therefore this transaction does not need to block following changes to backend read buffer.
At the end of a batch interval, we do not rollback boltdb Tx immediately. Instead, we create new boltdb Tx (and reset read buffer) and immediately starts the next batch interval. Any old boltdb Tx will be closed only when all kvstore read transactions that are referencing it are done.

Benchmark result (PR rebased on 5/8/2019)

Setup

Generated 100K keys

tools/benchmark/benchmark put --key-space-size 100000 --key-size 256 --val-size 10240 --total 100000 --sequential-keys

Throughput benchmark

1. Short reads

Result is comparable.

2. Writes

Result is comparable.
#10523 (comment)

3. Short reads + Writes

Result is comparable, this PR is slight worse in P99 and P999 read latency distribution.
This is likely to be the worst case scenario for this PR. Since the server is overloaded by writes, the backend read buffer is loaded with pending writes.
#10523 (comment)

4. Long reads + Writes

Result: this PR increased write throughput by 69.5%, reduced P99 write latency by 90.8%, reduced P999 write latency by 85.3%.
This is the scenario this PR tries to improve.
#10523 (comment)

5. Long reads + (short) Reads + Writes

Result: In presence of long expensive reads, short read throughput is increased by 77.5%, P99 read latency is reduced by 93.0%, P999 read latency is reduced by 82.9%; and write throughput is increased by 85.3%, P99 write latency is reduced by 94.4%, P999 write latency is reduced by 83.9%.
#10523 (comment)

Kubernetes scalability test result

Posted here: #10523 (comment)

jingyih · 2019-03-06T03:36:43Z

/cc @xiang90 @jpbetz @heyitsanthony

jingyih · 2019-03-06T03:42:26Z

Benchmark

Setup:
We used the same setup as in #9384 (comment), where we have 3 types of client requests running concurrently: long read, short read and write.

Result:
-Master
Short read response time histogram:
0.0006 [1] |
0.0741 [2609] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1475 [0] |
0.2210 [0] |
0.2945 [0] |
0.3680 [17] |
0.4414 [121] |∎
0.5149 [151] |∎∎
0.5884 [79] |∎
0.6618 [12] |
0.7353 [10] |

Write response time histogram:
0.0004 [1] |
0.0736 [638] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1468 [57] |∎∎∎
0.2199 [55] |∎∎∎
0.2931 [57] |∎∎∎
0.3663 [56] |∎∎∎
0.4395 [75] |∎∎∎∎
0.5127 [44] |∎∎
0.5859 [12] |
0.6590 [3] |
0.7322 [2] |

Long read response time histogram:
1.7072 [1] |∎∎∎
1.7822 [3] |∎∎∎∎∎∎∎∎∎∎
1.8572 [9] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9322 [12] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.0073 [8] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.0823 [2] |∎∎∎∎∎∎
2.1573 [2] |∎∎∎∎∎∎
2.2323 [0] |
2.3073 [0] |
2.3824 [2] |∎∎∎∎∎∎
2.4574 [1] |∎∎∎

-This PR
Short read response time histogram:
0.0006 [1] |
0.0082 [2831] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0158 [36] |
0.0234 [22] |
0.0309 [24] |
0.0385 [20] |
0.0461 [21] |
0.0537 [26] |
0.0613 [14] |
0.0689 [4] |
0.0765 [1] |

Write response time histogram:
0.0004 [1] |
0.0068 [936] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0131 [28] |∎
0.0195 [7] |
0.0258 [4] |
0.0322 [2] |
0.0385 [4] |
0.0449 [2] |
0.0512 [6] |
0.0576 [6] |
0.0639 [4] |

Long read response time histogram:
1.6783 [1] |∎∎∎∎∎
1.7156 [5] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.7528 [5] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.7901 [8] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.8274 [7] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.8647 [5] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9019 [2] |∎∎∎∎∎∎∎∎∎∎
1.9392 [2] |∎∎∎∎∎∎∎∎∎∎
1.9765 [1] |∎∎∎∎∎
2.0138 [2] |∎∎∎∎∎∎∎∎∎∎
2.0510 [2] |∎∎∎∎∎∎∎∎∎∎

jingyih · 2019-03-06T03:50:44Z

Please ignore the conflict, this PR needs a rebase to #10506.

mvcc/backend/backend.go

mvcc/backend/read_tx.go

mvcc/backend/backend.go

mvcc/backend/batch_tx.go

mvcc/backend/read_tx.go

mvcc/backend/backend.go

xiang90 · 2019-03-11T05:24:55Z

@jingyih Have you done a throughput benchmark? Copying the cache might affect throughput but not the latency.

jingyih · 2019-03-11T08:13:37Z

@xiang90

We do plan to get more benchmark results. Do you have suggestion on what kind of workload we should use?

Copying the cache will affect the system most if the cache size is large, i.e., when the system is write heavy. We plan to design a workload to simulate the worst case. We are also thinking about adding metrics (only for debugging purpose) to measure cache size and time duration of making the copy.

jingyih · 2019-03-11T23:09:05Z

@xiang90

~~BTW, in the previous benchmark, we are already hitting the throughput limit, i.e., the rate we are requesting as described in #9384 (comment) is higher than the actual rate.~~

Edit (5/9/2019): this statement is not correct.

jingyih · 2019-05-09T02:13:34Z

Rebased to master.

jingyih · 2019-05-09T05:49:07Z

Throughput comparison: short reads

tools/benchmark/benchmark range "A" "Z" --total 300000 --rate 10000 --conns 10 --clients 10

(Note that the requested ops per second is higher than actual served ops per second.)

Master

Summary:
Total: 45.4235 secs.
Slowest: 0.0167 secs.
Fastest: 0.0002 secs.
Average: 0.0010 secs.
Stddev: 0.0014 secs.
Requests/sec: 6604.5057

Response time histogram:
0.0002 [1] |
0.0019 [265438] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0035 [17638] |∎∎
0.0052 [7170] |∎
0.0068 [4330] |
0.0084 [3031] |
0.0101 [1845] |
0.0117 [450] |
0.0134 [64] |
0.0150 [23] |
0.0167 [10] |

Latency distribution:
10% in 0.0003 secs.
25% in 0.0004 secs.
50% in 0.0005 secs.
75% in 0.0008 secs.
90% in 0.0021 secs.
95% in 0.0039 secs.
99% in 0.0081 secs.
99.9% in 0.0105 secs.

Master + PR

Summary:
Total: 44.9010 secs.
Slowest: 0.0147 secs.
Fastest: 0.0002 secs.
Average: 0.0010 secs.
Stddev: 0.0014 secs.
Requests/sec: 6681.3650

Response time histogram:
0.0002 [1] |
0.0017 [260698] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0031 [20088] |∎∎∎
0.0046 [8230] |∎
0.0060 [4482] |
0.0075 [2960] |
0.0089 [2165] |
0.0103 [1202] |
0.0118 [151] |
0.0132 [20] |
0.0147 [3] |

Latency distribution:
10% in 0.0003 secs.
25% in 0.0004 secs.
50% in 0.0005 secs.
75% in 0.0008 secs.
90% in 0.0022 secs.
95% in 0.0037 secs.
99% in 0.0078 secs.
99.9% in 0.0101 secs.

jingyih · 2019-05-09T06:16:58Z

Throughput comparison: writes

tools/benchmark/benchmark put --key-space-size 100 --key-size 256 --val-size 10240 --total 60000 --rate 2000 --clients 10 --conns 10

(Note that the requested ops per second is higher than actual served ops per second.)

Master

Summary:
Total: 38.4857 secs.
Slowest: 0.2721 secs.
Fastest: 0.0003 secs.
Average: 0.0041 secs.
Stddev: 0.0058 secs.
Requests/sec: 1559.0213

Response time histogram:
0.0003 [1] |
0.0275 [59633] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0547 [356] |
0.0818 [0] |
0.1090 [0] |
0.1362 [0] |
0.1634 [0] |
0.1906 [0] |
0.2177 [0] |
0.2449 [0] |
0.2721 [10] |

Latency distribution:
10% in 0.0011 secs.
25% in 0.0018 secs.
50% in 0.0030 secs.
75% in 0.0042 secs.
90% in 0.0063 secs.
95% in 0.0153 secs.
99% in 0.0260 secs.
99.9% in 0.0333 secs.

Master + PR

Summary:
Total: 38.0085 secs.
Slowest: 0.0395 secs.
Fastest: 0.0003 secs.
Average: 0.0040 secs.
Stddev: 0.0047 secs.
Requests/sec: 1578.5939

Response time histogram:
0.0003 [1] |
0.0042 [45641] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0081 [10040] |∎∎∎∎∎∎∎∎
0.0120 [1093] |
0.0160 [333] |
0.0199 [822] |
0.0238 [1011] |
0.0277 [700] |
0.0316 [267] |
0.0356 [72] |
0.0395 [20] |

Latency distribution:
10% in 0.0011 secs.
25% in 0.0018 secs.
50% in 0.0030 secs.
75% in 0.0042 secs.
90% in 0.0059 secs.
95% in 0.0151 secs.
99% in 0.0261 secs.
99.9% in 0.0328 secs.

jingyih · 2019-05-09T06:38:33Z

Throughput comparison: short reads + writes

tools/benchmark/benchmark range "A" "Z" --total 50000 --rate 10000 --conns 10 --clients 10

tools/benchmark/benchmark put --key-space-size 100 --key-size 256 --val-size 10240 --total 60000 --rate 2000 --clients 10 --conns 10

(Note that the requested ops per second is higher than actual served ops per second.)

Master

Read

Summary:
Total: 40.9591 secs.
Slowest: 0.2451 secs.
Fastest: 0.0003 secs.
Average: 0.0080 secs.
Stddev: 0.0068 secs.
Requests/sec: 1220.7308

Response time histogram:
0.0003 [1] |
0.0248 [48584] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0493 [1405] |∎
0.0737 [0] |
0.0982 [0] |
0.1227 [0] |
0.1472 [0] |
0.1717 [0] |
0.1961 [0] |
0.2206 [0] |
0.2451 [10] |

Latency distribution:
10% in 0.0011 secs.
25% in 0.0024 secs.
50% in 0.0081 secs.
75% in 0.0092 secs.
90% in 0.0138 secs.
95% in 0.0214 secs.
99% in 0.0286 secs.
99.9% in 0.0357 secs.

Write

Summary:
Total: 38.7327 secs.
Slowest: 0.2424 secs.
Fastest: 0.0004 secs.
Average: 0.0044 secs.
Stddev: 0.0053 secs.
Requests/sec: 1549.0771

Response time histogram:
0.0004 [1] |
0.0246 [59568] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0488 [421] |
0.0730 [0] |
0.0972 [0] |
0.1214 [0] |
0.1456 [0] |
0.1698 [0] |
0.1940 [0] |
0.2182 [0] |
0.2424 [10] |

Latency distribution:
10% in 0.0014 secs.
25% in 0.0022 secs.
50% in 0.0034 secs.
75% in 0.0045 secs.
90% in 0.0071 secs.
95% in 0.0153 secs.
99% in 0.0236 secs.
99.9% in 0.0290 secs.

Master + PR

Read

Summary:
Total: 40.4403 secs.
Slowest: 0.2657 secs.
Fastest: 0.0003 secs.
Average: 0.0079 secs.
Stddev: 0.0071 secs.
Requests/sec: 1236.3901

Response time histogram:
0.0003 [1] |
0.0268 [48986] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0534 [1003] |
0.0799 [0] |
0.1065 [0] |
0.1330 [0] |
0.1596 [0] |
0.1861 [0] |
0.2126 [0] |
0.2392 [0] |
0.2657 [10] |

Latency distribution:
10% in 0.0009 secs.
25% in 0.0020 secs.
50% in 0.0081 secs.
75% in 0.0091 secs.
90% in 0.0135 secs.
95% in 0.0212 secs.
99% in 0.0299 secs.
99.9% in 0.0408 secs.

Write

Summary:
Total: 38.1133 secs.
Slowest: 0.2609 secs.
Fastest: 0.0003 secs.
Average: 0.0041 secs.
Stddev: 0.0054 secs.
Requests/sec: 1574.2543

Response time histogram:
0.0003 [1] |
0.0264 [59700] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0524 [289] |
0.0785 [0] |
0.1046 [0] |
0.1306 [0] |
0.1567 [0] |
0.1827 [0] |
0.2088 [0] |
0.2348 [0] |
0.2609 [10] |

Latency distribution:
10% in 0.0011 secs.
25% in 0.0019 secs.
50% in 0.0031 secs.
75% in 0.0042 secs.
90% in 0.0066 secs.
95% in 0.0146 secs.
99% in 0.0238 secs.
99.9% in 0.0339 secs.

jingyih · 2019-05-09T06:58:04Z

Throughput comparison: long reads + writes

tools/benchmark/benchmark range --prefix "" --total 60 --rate 1 --conns 2 --clients 2

tools/benchmark/benchmark put --key-space-size 100 --key-size 256 --val-size 10240 --total 60000 --rate 2000 --clients 10 --conns 10

(Note that the requested ops per second is higher than actual served ops per second.)

Master

Long Read

Summary:
Total: 62.1513 secs.
Slowest: 2.9222 secs.
Fastest: 1.6638 secs.
Average: 1.8026 secs.
Stddev: 0.2047 secs.
Requests/sec: 0.9654

Response time histogram:
1.6638 [1] |
1.7896 [44] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9154 [11] |∎∎∎∎∎∎∎∎∎∎
2.0413 [0] |
2.1671 [0] |
2.2930 [2] |∎
2.4188 [0] |
2.5447 [1] |
2.6705 [0] |
2.7964 [0] |
2.9222 [1] |

Latency distribution:
10% in 1.6877 secs.
25% in 1.7274 secs.
50% in 1.7508 secs.
75% in 1.7909 secs.
90% in 1.8802 secs.
95% in 2.2883 secs.

Write

Summary:
Total: 65.8218 secs.
Slowest: 0.5919 secs.
Fastest: 0.0003 secs.
Average: 0.0086 secs.
Stddev: 0.0465 secs.
Requests/sec: 911.5515

Response time histogram:
0.0003 [1] |
0.0595 [59265] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1186 [134] |
0.1778 [0] |
0.2370 [0] |
0.2961 [10] |
0.3553 [0] |
0.4145 [62] |
0.4736 [259] |
0.5328 [240] |
0.5919 [29] |

Latency distribution:
10% in 0.0010 secs.
25% in 0.0017 secs.
50% in 0.0029 secs.
75% in 0.0041 secs.
90% in 0.0065 secs.
95% in 0.0166 secs.
99% in 0.2727 secs.
99.9% in 0.5160 secs.

Master + PR

Long Read

Summary:
Total: 61.5149 secs.
Slowest: 2.5447 secs.
Fastest: 1.6309 secs.
Average: 1.7897 secs.
Stddev: 0.1694 secs.
Requests/sec: 0.9754

Response time histogram:
1.6309 [1] |∎
1.7223 [20] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.8136 [29] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9050 [2] |∎∎
1.9964 [2] |∎∎
2.0878 [2] |∎∎
2.1792 [2] |∎∎
2.2705 [0] |
2.3619 [0] |
2.4533 [0] |
2.5447 [2] |∎∎

Latency distribution:
10% in 1.6706 secs.
25% in 1.7047 secs.
50% in 1.7399 secs.
75% in 1.7940 secs.
90% in 2.0302 secs.
95% in 2.1252 secs.

Write

Summary:
Total: 38.8433 secs.
Slowest: 0.0912 secs.
Fastest: 0.0003 secs.
Average: 0.0040 secs.
Stddev: 0.0061 secs.
Requests/sec: 1544.6694

Response time histogram:
0.0003 [1] |
0.0094 [56287] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0185 [1841] |∎
0.0276 [1398] |
0.0367 [133] |
0.0457 [50] |
0.0548 [22] |
0.0639 [111] |
0.0730 [68] |
0.0821 [66] |
0.0912 [23] |

Latency distribution:
10% in 0.0010 secs.
25% in 0.0017 secs.
50% in 0.0028 secs.
75% in 0.0040 secs.
90% in 0.0061 secs.
95% in 0.0145 secs.
99% in 0.0251 secs.
99.9% in 0.0760 secs.

jingyih · 2019-05-10T01:49:44Z

@jingyih Have you done a throughput benchmark? Copying the cache might affect throughput but not the latency.

@xiang90
Updated with new benchmark results (for throughput). Summarized in the end of comment #1.

jingyih · 2019-05-10T20:26:22Z

Throughput comparison: long reads + (short) reads + writes

tools/benchmark/benchmark range --prefix "" --total 80 --rate 1 --conns 2 --clients 2

tools/benchmark/benchmark range "A" "Z" --total 50000 --rate 10000 --conns 10 --clients 10

tools/benchmark/benchmark put --key-space-size 100 --key-size 256 --val-size 10240 --total 60000 --rate 2000 --clients 10 --conns 10

Master

Long Read

Summary:
Total: 81.9649 secs.
Slowest: 2.5878 secs.
Fastest: 1.6621 secs.
Average: 1.8583 secs.
Stddev: 0.1484 secs.
Requests/sec: 0.9760

Response time histogram:
1.6621 [1] |
1.7547 [4] |∎∎∎
1.8473 [52] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9398 [14] |∎∎∎∎∎∎∎∎∎∎
2.0324 [4] |∎∎∎
2.1249 [0] |
2.2175 [1] |
2.3101 [0] |
2.4026 [2] |∎
2.4952 [1] |
2.5878 [1] |

Latency distribution:
10% in 1.7674 secs.
25% in 1.7961 secs.
50% in 1.8277 secs.
75% in 1.8530 secs.
90% in 1.9546 secs.
95% in 2.3388 secs.

(Short) Read

Summary:
Total: 78.3127 secs.
Slowest: 0.8620 secs.
Fastest: 0.0003 secs.
Average: 0.0155 secs.
Stddev: 0.0597 secs.
Requests/sec: 638.4658

Response time histogram:
0.0003 [1] |
0.0865 [49225] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1727 [14] |
0.2588 [10] |
0.3450 [10] |
0.4312 [58] |
0.5173 [554] |
0.6035 [118] |
0.6897 [0] |
0.7759 [0] |
0.8620 [10] |

Latency distribution:
10% in 0.0014 secs.
25% in 0.0047 secs.
50% in 0.0081 secs.
75% in 0.0094 secs.
90% in 0.0149 secs.
95% in 0.0228 secs.
99% in 0.4799 secs.
99.9% in 0.5367 secs.

Write

Summary:
Total: 75.1863 secs.
Slowest: 0.8549 secs.
Fastest: 0.0004 secs.
Average: 0.0105 secs.
Stddev: 0.0540 secs.
Requests/sec: 798.0180

Response time histogram:
0.0004 [1] |
0.0858 [59228] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1713 [10] |
0.2567 [10] |
0.3422 [10] |
0.4276 [71] |
0.5131 [549] |
0.5985 [112] |
0.6840 [0] |
0.7694 [0] |
0.8549 [9] |

Latency distribution:
10% in 0.0015 secs.
25% in 0.0023 secs.
50% in 0.0035 secs.
75% in 0.0046 secs.
90% in 0.0078 secs.
95% in 0.0170 secs.
99% in 0.4489 secs.
99.9% in 0.5298 secs.

Master + PR

Long Read

Summary:
Total: 83.0383 secs.
Slowest: 2.6376 secs.
Fastest: 1.6201 secs.
Average: 1.8671 secs.
Stddev: 0.2233 secs.
Requests/sec: 0.9634

Response time histogram:
1.6201 [1] |∎
1.7218 [25] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.8236 [10] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9253 [23] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.0271 [9] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.1289 [4] |∎∎∎∎∎∎
2.2306 [2] |∎∎∎
2.3324 [2] |∎∎∎
2.4341 [0] |
2.5359 [0] |
2.6376 [4] |∎∎∎∎∎∎

Latency distribution:
10% in 1.6657 secs.
25% in 1.7005 secs.
50% in 1.8408 secs.
75% in 1.9274 secs.
90% in 2.1897 secs.
95% in 2.5555 secs.

(Short) Read

Summary:
Total: 44.1265 secs.
Slowest: 0.4275 secs.
Fastest: 0.0003 secs.
Average: 0.0086 secs.
Stddev: 0.0108 secs.
Requests/sec: 1133.1050

Response time histogram:
0.0003 [1] |
0.0431 [49631] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0858 [217] |
0.1285 [131] |
0.1712 [0] |
0.2139 [0] |
0.2567 [0] |
0.2994 [10] |
0.3421 [0] |
0.3848 [0] |
0.4275 [10] |

Latency distribution:
10% in 0.0013 secs.
25% in 0.0037 secs.
50% in 0.0081 secs.
75% in 0.0094 secs.
90% in 0.0139 secs.
95% in 0.0213 secs.
99% in 0.0336 secs.
99.9% in 0.0919 secs.

Write

Summary:
Total: 40.5730 secs.
Slowest: 0.4169 secs.
Fastest: 0.0003 secs.
Average: 0.0046 secs.
Stddev: 0.0091 secs.
Requests/sec: 1478.8157

Response time histogram:
0.0003 [1] |
0.0420 [59668] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0836 [255] |
0.1253 [56] |
0.1669 [0] |
0.2086 [0] |
0.2503 [0] |
0.2919 [10] |
0.3336 [0] |
0.3752 [0] |
0.4169 [10] |

Latency distribution:
10% in 0.0012 secs.
25% in 0.0020 secs.
50% in 0.0032 secs.
75% in 0.0043 secs.
90% in 0.0070 secs.
95% in 0.0153 secs.
99% in 0.0252 secs.
99.9% in 0.0851 secs.

jingyih · 2019-06-05T21:27:12Z

GCE 5k node scalability test result

Sig-scalability lead @wojtek-t helped me running the test and provided a quick analysis.

The PR that overwrote etcd image, which is later reverted after few test runs: Experimental etcd image for 5k node scalability tests kubernetes/test-infra#12710
Link to the test with new etcd image, which is 1130745634945503235: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1130745634945503235

Result compared to baseline

Using test run 1130383116527996934 as baseline. It is right before 1130745634945503235.

Expensive LIST requests themselves are comparable. Using "LIST pods" as example.

Latency (ms)	Baseline	Image built from this PR
P50	1927	2185
P90	5280	5349
P99	7725	7871

Ref: http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=DensityResponsiveness&Resource=pods&Scope=cluster&Subresource=&Verb=LIST

At the very beginning of the test where there are a lot of expensive "LIST pods", POST requests have much better latency. The example here is "POST clusterrolebindings".

Latency (ms)	Baseline	Image built from this PR
P50	2.40	2.67
P90	67.4	11.2
P99	993	25.4

Ref: http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=DensityResponsiveness&Resource=clusterrolebindings&Scope=cluster&Subresource=&Verb=POST

"POST pods" also has better latency, this type of request spread through most part of the test.

Latency (ms)	Baseline	Image built from this PR
P50	18.4	3.16
P90	400	38.6
P99	962	218

Ref: http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=DensityResponsiveness&Resource=pods&Scope=namespace&Subresource=&Verb=POST

xiang90 · 2019-06-05T21:32:41Z

The P99 of POST pod is still bad. Probably worth some effort to dig.

jingyih · 2019-06-05T21:44:23Z

The P99 of POST pod is still bad. Probably worth some effort to dig.

My understanding is that there was a scalability regression back then, the scalability team spent couple weeks trying to root cause it. (Not sure about the current status)

This test was ran during their debugging effort. P99 of POST pod is bad but not due to this PR.

jingyih · 2019-06-11T21:55:24Z

@xiang90 I updated the test: 2a9320e. Is this what you had in mind?

Add TestConcurrentReadTxAndWrite which creates random reads and writes, and ensures reads always see latest writes.

xiang90 · 2019-06-12T06:12:44Z

@jingyih

yes. that is what i would expect.

jingyih · 2019-06-12T06:20:23Z

how long does it take to finish this test?

I think locally it takes between 1 to 2 seconds to finish. On Travis I did not see timing for each individual unit test. I compared the finishing time of mvcc package with and without the new test, it is comparable.
https://travis-ci.com/etcd-io/etcd/jobs/207264327#L591
https://travis-ci.com/etcd-io/etcd/jobs/207257875#L591

xiang90 · 2019-06-12T06:24:16Z

LGTM.

maybe @jpbetz wants to give this a final review.

jingyih · 2019-06-12T06:27:05Z

Thanks! @xiang90

mvcc/backend/backend.go

jpbetz · 2019-06-12T17:36:14Z

mvcc/backend/backend.go

@@ -493,6 +518,7 @@ func (b *backend) begin(write bool) *bolt.Tx {
 	db := tx.DB()
 	atomic.StoreInt64(&b.size, size)
 	atomic.StoreInt64(&b.sizeInUse, size-(int64(db.Stats().FreePageN)*int64(db.Info().PageSize)))


nit: Since the db.Stats() call acquires a read lock, only call it once here (instead of once for FreePageN and once for OpenTxN) ?

Sounds good.

jpbetz · 2019-06-12T17:41:08Z

mvcc/backend/batch_tx.go

@@ -323,6 +319,17 @@ func (t *batchTxBuffered) unsafeCommit(stop bool) {
 	}
 }

+func waitAndRollback(tx *bolt.Tx, wg *sync.WaitGroup, lg *zap.Logger) {


Should this be a receiver function of batchTxBuffered? The parameters don't all appear necessary.

tx and wg need to be passes into the function. batchTxBuffered's tx and wg will be reset / re-created immediate after this function call, whereas this function waits until all ongoing reads using the input tx is done.

Making this a receiver and moving the logger out of the params might make that even more clear.

Did you mean something like func (lg *zap.Logger) waitAndRollback(tx *bolt.Tx, wg *sync.WaitGroup)? I have mixed feeling about this. By convention (in this repo), zap logger is passed to functions as input parameter.

I was thinking more like func (t *batchTxBuffered) waitAndRollback(tx *bolt.Tx, wg *sync.WaitGroup) so that the only params that need to be passed are the ones that are semantically important to the call. But this is minor,

Or just inline the function in unsafeCommit? it doesn't seem to benefit from being separate.

jpbetz · 2019-06-12T17:52:59Z

mvcc/backend/read_tx.go

+
+func (rt *concurrentReadTx) Lock()    {}
+func (rt *concurrentReadTx) Unlock()  {}
+func (rt *concurrentReadTx) RLock()   {}


I'm a bit surprised we don't do anything in RLock(). Should we try to adhere to the lock contract more closely (i.e. move logic from ConcurrentReadTx to here) ? If not we should clearly document what's going on.

Should Lock() and Unlock() be noops? Alternatives would be to delegate to RLock()/RUnock() or to panic.

This one is tricky. ConcurrentReadTx() holds a lock when it is being created (and everything needs to happen when holding that lock, so cannot really move part of the logic to a separate function). After creation, there is no need / concept of locking and unlocking. So these functions end up being no-op. They only exist so that *concurrentReadTx implements ReadTx interface.

Adding some comments with what you just said would certainly make it easier to understand.

The violation of the ReadTx interface guarantees do make for some odd behavior, e.g.:

tx := ConcurrentReadTx() tx.RLock() // do something tx.RUnlock() tx.RLock() // do something tx.RUnlock()

Would certainly not do what a developer might expect..

I agree. I am not happy with the current state. The current implementation of ConcurrentReadTx is compromised in readability because I wanted to do minimum change to various other existing interfaces. (Tried to define a new interface for concurrent read Tx, it caused ripple effects and the end result is kind of messy). Fortunately ReadTx interface is used internally in mvcc backend. Developer normally do not need to interact with this level. Instead, the mvcc store level transaction interface [1] should be enough (which is kept unchanged).

But I will add some comments try to clarify the implementation.

[1]

etcd/mvcc/kv.go

Lines 55 to 57 in 0de9b8a

// TxnRead represents a read-only transaction with operations that will not

// block other read transactions.

type TxnRead interface {

Okay, let's loop back on this interface later. The comments should be sufficient for now.

mvcc/backend/read_tx.go

jingyih · 2019-06-13T00:23:27Z

@jpbetz Tried to address your comments in a823555. PTAL.

jpbetz

lgtm

jpbetz · 2019-06-13T16:05:17Z

From Travis CI:

gofmt checking failed:
mvcc/kvstore_txn.go
diff -u mvcc/kvstore_txn.go.orig mvcc/kvstore_txn.go
--- mvcc/kvstore_txn.go.orig	2019-06-13 00:24:39.392461717 +0000
+++ mvcc/kvstore_txn.go	2019-06-13 00:24:39.392461717 +0000
@@ -49,7 +49,7 @@
 }
 
 func (tr *storeTxnRead) End() {
-	tr.tx.RUnlock()	// RUnlock signals the end of concurrentReadTx.
+	tr.tx.RUnlock() // RUnlock signals the end of concurrentReadTx.
 	tr.s.mu.RUnlock()
 }

Some whitespace difference?

jingyih · 2019-06-14T01:28:00Z

@jpbetz Updated. PTAL.

jpbetz · 2019-06-14T02:31:19Z

Great work!

LGTM

xiang90 · 2019-06-14T02:32:51Z

@jingyih can you check the CI failures?

jingyih · 2019-06-14T03:14:15Z

Travis CI failed with:

=== RUN TestLeasingGetWithOpts
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0xb5c7ca]
goroutine 24216 [running]:
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc00065c160, 0x1346c20, 0xc0003b0ec0, 0xc002edc378, 0x5, 0x5, 0xc002edc380, 0xb, 0xb, 0xc002edc380, ...)
/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:713 +0x14a
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*clientStream).SendMsg.func2(0xc00065c160, 0xb5b88f, 0xc00016f130)
/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:637 +0x186
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc00183a240, 0xc00016f130, 0xc0003edaa8, 0xc002edc378, 0x0)
/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:530 +0x43d
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc00183a240, 0x1346c20, 0xc0003b0ec0, 0x0, 0x0)
/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:643 +0x777
go.etcd.io/etcd/etcdserver/etcdserverpb.(*watchWatchClient).Send(0xc00088cbb0, 0xc0003b0ec0, 0x0, 0x1e53018)
/go/src/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3697 +0x6b
go.etcd.io/etcd/clientv3.(*watchGrpcStream).run(0xc0001275f0)
/go/src/go.etcd.io/etcd/clientv3/watch.go:546 +0xd9b
created by go.etcd.io/etcd/clientv3.(*watcher).newWatcherGrpcStream
/go/src/go.etcd.io/etcd/clientv3/watch.go:277 +0x58c

jingyih · 2019-06-14T03:15:25Z

The stack trace looks basically the same as the known flaky tests "TestLeasingXXX" listed in #10700. I don't think it is related to this PR.

jingyih · 2019-06-14T03:17:00Z

semaphoreci failed with a known flaky test TestCtlV3AuthFromKeyPerm.

wojtek-t · 2019-06-25T11:05:52Z

This is awesome to see landing before 3.4 release - great work @jingyih and thanks a lot for making that happen to reviewers.

Just to answer one thing (sorry, 3 weeks on vacation)

The P99 of POST pod is still bad. Probably worth some effort to dig.

@xiang90 - note that those aren't etcd latencies, but apiserver latencies. The latencies are affected mostly by apiserver-related stuff. That experimental image obviously doesn't help for what apiserver is doing, thus the number are not that small as we would like them to see :)

jingyih added the WIP label Mar 6, 2019

jingyih changed the title ~~[WIP] mvcc: full concurrent read~~ [WIP] mvcc: fully concurrent read Mar 6, 2019

jpbetz mentioned this pull request Mar 6, 2019

Fully concurrent reads design proposal #10525

Closed

jpbetz suggested changes Mar 6, 2019

View reviewed changes

jpbetz mentioned this pull request Apr 3, 2019

[WIP] *: Fully concurrent large reads #10472

Closed

jingyih force-pushed the fully_concurrent_reads branch from 6bff469 to 9da1f23 Compare May 7, 2019 21:32

mvcc: fully concurrent read

a73fb85

jingyih force-pushed the fully_concurrent_reads branch from 9da1f23 to a73fb85 Compare May 9, 2019 02:12

jingyih requested review from jpbetz and xiang90 May 9, 2019 18:07

jingyih changed the title ~~[WIP] mvcc: fully concurrent read~~ mvcc: fully concurrent read May 29, 2019

jingyih removed the WIP label May 29, 2019

jingyih mentioned this pull request May 30, 2019

mvcc: revert change made by #10526 and #10699 #10772

Merged

xiang90 self-assigned this Jun 5, 2019

jingyih force-pushed the fully_concurrent_reads branch from 119134b to 19ee3a9 Compare June 11, 2019 22:47

mvcc: add TestConcurrentReadTxAndWrite

2a9320e

Add TestConcurrentReadTxAndWrite which creates random reads and writes, and ensures reads always see latest writes.

jingyih force-pushed the fully_concurrent_reads branch from 19ee3a9 to 2a9320e Compare June 12, 2019 00:06

jpbetz suggested changes Jun 12, 2019

View reviewed changes

jingyih force-pushed the fully_concurrent_reads branch from 2ec1a25 to a823555 Compare June 13, 2019 00:22

jpbetz approved these changes Jun 13, 2019

View reviewed changes

mvcc: address comments

55066eb

jingyih force-pushed the fully_concurrent_reads branch from a823555 to 55066eb Compare June 14, 2019 01:06

xiang90 merged commit 2c5162a into etcd-io:master Jun 14, 2019

This was referenced Aug 13, 2019

CHANGELOG-3.4: add etcd_mvcc_db_open_read_transactions #11032

Merged

etcd clientv3 WithCountOnly performence #11036

Closed

How does etcd compute the revision number in an atomic and scalable fashion? #11054

Closed

mitake mentioned this pull request Oct 20, 2019

WIP, DO NOT MERGE: *: make Range() fine grained #9199

Closed

tangcong mentioned this pull request Dec 8, 2020

backend: do not copy buffer when creating read tx #12529

Closed

ptabor mentioned this pull request Feb 17, 2021

transaction logic improvement #12692

Closed

gyuho mentioned this pull request May 11, 2021

server: set multiple concurrentReadTx instances share one txReadBuffer #12933

Merged

	func (s *store) Read() TxnRead {
	s.mu.RLock()
	tx := s.b.ReadTx()

	t.backend.readTx.Lock() // blocks txReadBuffer for writing.
	t.buf.writeback(&t.backend.readTx.buf)
	t.backend.readTx.Unlock()

	t.backend.readTx.Lock()
	t.unsafeCommit(stop)
	t.backend.readTx.Unlock()

	// TxnRead represents a read-only transaction with operations that will not
	// block other read transactions.
	type TxnRead interface {

mvcc: fully concurrent read #10523

mvcc: fully concurrent read #10523

Conversation

jingyih commented Mar 6, 2019 • edited Loading

TL;DR

What does this PR solve

Concurrency bottleneck in current backend implementation

How this PR addresses the bottleneck

Benchmark result (PR rebased on 5/8/2019)

Setup

Throughput benchmark

1. Short reads

2. Writes

3. Short reads + Writes

4. Long reads + Writes

5. Long reads + (short) Reads + Writes

Kubernetes scalability test result

jingyih commented Mar 6, 2019

jingyih commented Mar 6, 2019 • edited Loading

jingyih commented Mar 6, 2019

xiang90 commented Mar 11, 2019

jingyih commented Mar 11, 2019

jingyih commented Mar 11, 2019 • edited Loading

jingyih commented May 9, 2019

jingyih commented May 9, 2019 • edited Loading

Throughput comparison: short reads

Master

Master + PR

jingyih commented May 9, 2019

Throughput comparison: writes

Master

Master + PR

jingyih commented May 9, 2019

Throughput comparison: short reads + writes

Master

Read

Write

Master + PR

Read

Write

jingyih commented May 9, 2019

Throughput comparison: long reads + writes

Master

Long Read

Write

Master + PR

Long Read

Write

jingyih commented May 10, 2019

jingyih commented May 10, 2019

Throughput comparison: long reads + (short) reads + writes

Master

Long Read

(Short) Read

Write

Master + PR

Long Read

(Short) Read

Write

jingyih commented Jun 5, 2019 • edited Loading

GCE 5k node scalability test result

Result compared to baseline

xiang90 commented Jun 5, 2019

jingyih commented Jun 5, 2019

jingyih commented Jun 11, 2019 • edited Loading

xiang90 commented Jun 12, 2019

jingyih commented Jun 12, 2019

xiang90 commented Jun 12, 2019

jingyih commented Jun 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingyih commented Mar 6, 2019 •

edited

Loading

jingyih commented Mar 6, 2019 •

edited

Loading

jingyih commented Mar 11, 2019 •

edited

Loading

jingyih commented May 9, 2019 •

edited

Loading

jingyih commented Jun 5, 2019 •

edited

Loading

jingyih commented Jun 11, 2019 •

edited

Loading