-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: make applier use ReadTx() in Txn() instead of ConcurrentReadTx() #12896
Conversation
More performance evaluation will be done on this. |
Codecov Report
@@ Coverage Diff @@
## master #12896 +/- ##
==========================================
- Coverage 69.60% 63.18% -6.42%
==========================================
Files 411 405 -6
Lines 33421 33439 +18
==========================================
- Hits 23263 21129 -2134
- Misses 8208 10145 +1937
- Partials 1950 2165 +215
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This is the result compared with the master branch. |
Rely good numbers. @jingyih - you are an expert on concurrentTx, could you, pls, take a look |
d57e8ea
to
457c6c1
Compare
I am doing another mixed read/write workload performance evaluation. Will update with the data after around 48 hours. |
457c6c1
to
05170c2
Compare
I just realized that there is an issue with my code that for read Txn, we are still using ReadTx() and in fact here we should use ConcurrentReadTx(). Updated the patch, doing another round of performance evaluation. |
05170c2
to
8d8d037
Compare
@wilsonwang371 What does "performance" in the heat graph mean? Is that TPS? Can you share more details about provision setup? Single node, or multi node? |
Hi Gyuho, To avoid networking interference, I am always using a single node in this patch performance evaluation. The heat map (yellow-green) is either master branch or my patch txn benchmark (queries per second) QPS. The heat map (red-blue) is the QPS difference between master branch and my patch. I shared my python script for plotting the data and also the test data details in #12692. you can get more information from there. Let me know if you have other questions. |
@wilsonwang371 That's awesome!
Have we run another round of performance tests yet? Once we confirm, we can merge. |
it will take another 24 hours to get the data generated. I will post them once available. |
@wilsonwang371 Very cool. Can you share more details for the machine spec and how you generated client-side workloads? Are we just using benchmark tool? |
The machine setup is: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz 256GB NVMe SSD-2T* 2 Mellanox CX-5 The benchmark tool is etcd's default benchmark tool. I just make some small modification so that it can send both read and write txn requests. The client and server are running on the same machine. |
53fafc3
to
5615421
Compare
If I understand this correctly, we get up to 2x throughput due to increased concurrency for writes because we make reads fully concurrent?
I think it's a safe trade-off to make given that we already did a round of optimizations for kube-apiserver reads (list and watch) for 3.4? Or could be a noise. I think the increased concurrency in writes should be good enough to compensate this. |
@wilsonwang371 Could you share exact |
In this graph, the ReadTxn/WriteTxn = 0.125 = 1/8. That means we are doing 8 txn writes while doing 1 txn read. The gain is from avoiding shared buffer copying which speeds up the txn writes. I am using etcd benchmark for this but I made some small change to the benchmark tool so that it can do both txn-range and txn-put at the same time. Here is the new file I added to the benchmark. // Copyright 2017 The etcd Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cmd
import (
"context"
"encoding/binary"
"fmt"
"math"
"math/rand"
"os"
"time"
v3 "go.etcd.io/etcd/client/v3"
"go.etcd.io/etcd/pkg/v3/report"
"github.com/spf13/cobra"
"golang.org/x/time/rate"
"gopkg.in/cheggaaa/pb.v1"
)
// mixeTxnCmd represents the mixedTxn command
var mixedTxnCmd = &cobra.Command{
Use: "mixed-txn key [end-range]",
Short: "Benchmark a mixed load of txn-put & txn-range",
Run: mixedTxnFunc,
}
var (
mixedTxnTotal int
mixedTxnRate int
mixedTxnOpsPerTxn int
mixedTxnReadWriteRatio float64
writeCount uint64
readCount uint64
)
func init() {
RootCmd.AddCommand(mixedTxnCmd)
mixedTxnCmd.Flags().IntVar(&keySize, "key-size", 8, "Key size of mixed txn")
mixedTxnCmd.Flags().IntVar(&valSize, "val-size", 8, "Value size of mixed txn")
mixedTxnCmd.Flags().IntVar(&mixedTxnOpsPerTxn, "txn-ops", 1, "Number of ops per txn")
mixedTxnCmd.Flags().IntVar(&mixedTxnRate, "rate", 0, "Maximum txns per second (0 is no limit)")
mixedTxnCmd.Flags().IntVar(&mixedTxnTotal, "total", 10000, "Total number of txn requests")
mixedTxnCmd.Flags().IntVar(&keySpaceSize, "key-space-size", 1, "Maximum possible keys")
mixedTxnCmd.Flags().StringVar(&rangeConsistency, "consistency", "l", "Linearizable(l) or Serializable(s)")
mixedTxnCmd.Flags().Float64Var(&mixedTxnReadWriteRatio, "rw-ratio", 1, "Read/write ops ratio")
}
func mixedTxnFunc(cmd *cobra.Command, args []string) {
if keySpaceSize <= 0 {
fmt.Fprintf(os.Stderr, "expected positive --key-space-size, got (%v)", keySpaceSize)
os.Exit(1)
}
if mixedTxnOpsPerTxn > keySpaceSize {
fmt.Fprintf(os.Stderr, "expected --txn-ops no larger than --key-space-size, "+
"got txn-ops(%v) key-space-size(%v)\n", txnPutOpsPerTxn, keySpaceSize)
os.Exit(1)
}
if rangeConsistency == "l" {
fmt.Println("bench with linearizable range")
} else if rangeConsistency == "s" {
fmt.Println("bench with serializable range")
} else {
fmt.Fprintln(os.Stderr, cmd.Usage())
os.Exit(1)
}
requests := make(chan []v3.Op, totalClients)
if mixedTxnRate == 0 {
mixedTxnRate = math.MaxInt32
}
limit := rate.NewLimiter(rate.Limit(mixedTxnRate), 1)
clients := mustCreateClients(totalClients, totalConns)
k, v := make([]byte, keySize), string(mustRandBytes(valSize))
bar = pb.New(mixedTxnTotal)
bar.Format("Bom !")
bar.Start()
r := newReport()
for i := range clients {
wg.Add(1)
go func(c *v3.Client) {
defer wg.Done()
for ops := range requests {
limit.Wait(context.Background())
st := time.Now()
_, err := c.Txn(context.TODO()).Then(ops...).Commit()
r.Results() <- report.Result{Err: err, Start: st, End: time.Now()}
bar.Increment()
}
}(clients[i])
}
go func() {
for i := 0; i < mixedTxnTotal; i++ {
ops := make([]v3.Op, mixedTxnOpsPerTxn)
for j := 0; j < mixedTxnOpsPerTxn; j++ {
op := v3.Op{}
if rand.Float64() < mixedTxnReadWriteRatio / (1 + mixedTxnReadWriteRatio) {
opts := []v3.OpOption{v3.WithRange("")}
if rangeConsistency == "s" {
opts = append(opts, v3.WithSerializable(), v3.WithPrevKV(), v3.WithPrefix())
}
op = v3.OpGet("rangeKey", opts...)
readCount++
} else {
binary.PutVarint(k, int64(((i*mixedTxnOpsPerTxn)+j)%keySpaceSize))
op = v3.OpPut(string(k), v)
writeCount++
}
ops[j] = op
}
requests <- ops
}
close(requests)
}()
rc := r.Run()
wg.Wait()
close(r.Results())
bar.Finish()
fmt.Printf("READ: %d, WRITE: %d\n", readCount, writeCount)
fmt.Println(<-rc)
} |
Yes, you are right. Now I read how we create different types of transactions for write txn. |
Here is the test script that I am using. #!/bin/bash
set -xe;
etcd_path="/home/wilson.wang/etcd-master";
run_path="/data00/etcd-tests";
run_count=$((1024*256));
# standard k8s setting
limit_count=$((1024*256));
keys_count=$((1024*256));
backend_size=$((20*1024*1024*1024));
pushd ${run_path};
trap ctrl_c INT
function ctrl_c() {
for i in $(ps aux | grep -v grep | grep etcd | awk "{print \$2}");
do
kill -9 $i;
done;
popd;
exit 0;
}
for ratio_str in 1/8 1/4 1/2 2/1 4/1 8/1 100/1;
do
ratio=$(echo "scale=3; ${ratio_str}" | bc -l);
for vallen in $(seq 4 10);
do
for i in $(seq 5 12);
do
rm -rf default.etcd/;
${etcd_path}/bin/etcd --quota-backend-bytes=${backend_size} \
--listen-client-urls http://0.0.0.0:23790 \
--advertise-client-urls http://127.0.0.1:23790 \
--log-level 'error' &
pid=$!;
sleep 6;
conn=$((2**$i));
valsize=$((2**(4 + $vallen)));
${etcd_path}/tools/benchmark/benchmark put --sequential-keys \
--key-space-size=${keys_count} \
--val-size=${valsize} --key-size=256 \
--endpoints "http://127.0.0.1:23790" \
--total=${keys_count} &>/dev/null ;
sleep 6;
line="${ratio}, ${conn}, ${valsize}";
for j in $(seq 0 4);
do
tmp=$(${etcd_path}/tools/benchmark/benchmark mixed-txn "" \
--conns=${conn} --clients=${conn} \
--total=${run_count} \
--endpoints "http://127.0.0.1:23790" \
--rw-ratio ${ratio} \
2>/dev/null | grep Requests | awk "{print \$2}");
line="$line, ${tmp}";
sleep 20;
done;
echo ${line};
kill -9 ${pid};
sleep 5;
done;
done;
done;
popd; |
@ptabor @jingyih I am inclined for this change.
Based on @wilsonwang371's tests, there's no noticeable penalty for this type of workloads. Actually, the Kubernetes etcd transaction is 100% read-write transaction for compact + guaranteed update calls -- Please comment if you have any concerns for this change as default mode. |
The reason I posted this new patch is that I want to separate the concern between txn optimization and the one proposed in #12529. In this case, if we have further optimizations in the shared buffer, we won't have some new conflict logic and it will be easier to further improve etcd performance. |
Ideally, after a patch that is similar to the one proposed in #12529, we should see a performance graphs similar to the ones in #12896 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wilsonwang371 To move forward, let's make this configurable.
It will be great if we can validate this with Kubernetes scale tests.
From etcd 3.4 release blog post where we set concurrent read tx by default:
We further made backend read transactions fully concurrent. Previously, ongoing long-running read transactions block writes and upcoming reads. With this change, write throughput is increased by 70% and P99 write latency is reduced by 90% in the presence of long-running reads. We also ran Kubernetes 5000-node scalability test on GCE with this change and observed similar improvements. For example, in the very beginning of the test where there are a lot of long-running “LIST pods”, the P99 latency of “POST clusterrolebindings” is reduced by 97.4%.
How about we add a flag
etcd --txn-mode
where we configure:
read-with-copied-buffer
(default value, use concurrent read tx no matter what)write-with-shared-buffer
(optional, if txn includes write, fall back to shared buffer)
That way, we can make incremental changes for #12692 with etcd --txn-mode=only-long-read-with-copied-buffer
to not create concurrent tx for short writes.
I think this results are awesome. I'm not insisting on a flag any longer (as in the long run its adds complexity), @jkaniuk FYI: about performance impact |
Sounds good. @wilsonwang371 Can we make this a flag (e..g, |
Done. By default, using a shared buffer is on. |
b4b4de1
to
27e3fd8
Compare
27e3fd8
to
98083ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thx!
Thanks @ptabor for the heads up.
As I understand from those graphs, there should be 1.05-1.25x improvement. Great results! |
nice |
This is related to transaction logic improvement #12692. As per Piotr's comment, it might be a better choice to use ReadTx(). ReadTx() does not copy buffer while ConcurrentReadTx() does. Therefore, we can avoid slow transaction execution in applier while minimizing the scope of a write transaction.