Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse events used for syncing watchers #17563

Merged
merged 2 commits into from
Dec 2, 2024
Merged

Conversation

serathius
Copy link
Member

@serathius serathius commented Mar 10, 2024

Part of #16839
Improve etcd behavior during watch congestion by reducing the memory needed.
Congestion causes watch requests to backup and resync loop to allocate tons of memory. This patch has shown to reduce memory needed 5-10 times.

Reuse events object size [KB] etcd memory[MB] latency 50%ile latency 90%ile latency 99%ile
FALSE 5 111.72 0.1005 0.1543 0.1856
TRUE 5 109.096 0.1012 0.1556 0.1874
FALSE 10 1103.924 2.7177 7.3661 11.9127
TRUE 10 291.184 2.6623 7.687 12.5456
FALSE 100 5138.712 14.0618 26.2721 29.1691
TRUE 100 641.844 13.9372 25.8267 28.519

Tested using command go run ./tools/benchmark/main.go watch-latency --watch-per-stream 1000 --streams 1 --put-total 200 --val-size 100000. Parameters adjusted for total object size be a 20MB.

@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@chaochn47
Copy link
Member

Limit event prefix by slowest watch instead of compaction.

Do we expect sending events to compacted watchers (the slowest one) though? I think it should not. WDYT?

@serathius
Copy link
Member Author

Do we expect sending events to compacted watchers (the slowest one) though? I think it should not. WDYT?

I expect that minRev and rev of victims is above compactedRev, if now we can just add a protection.

Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 97.14286% with 1 line in your changes missing coverage. Please review.

Project coverage is 68.71%. Comparing base (78885f6) to head (348c0cb).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
server/storage/mvcc/watchable_store.go 97.14% 1 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
server/storage/mvcc/watchable_store.go 93.91% <97.14%> (+0.41%) ⬆️

... and 22 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #17563      +/-   ##
==========================================
- Coverage   68.80%   68.71%   -0.09%     
==========================================
  Files         420      420              
  Lines       35599    35619      +20     
==========================================
- Hits        24494    24476      -18     
- Misses       9675     9712      +37     
- Partials     1430     1431       +1     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78885f6...348c0cb. Read the comment docs.

@serathius serathius changed the title [Draft] Reuse events used for syncing watchers Reuse events used for syncing watchers Nov 29, 2024
@serathius
Copy link
Member Author

serathius commented Nov 29, 2024

Ready to review
ping @ahrtr

@serathius
Copy link
Member Author

/retest

Comment on lines +245 to +252
{minRev: 2, maxRev: 3, expectEvents: expectEvents[0:1]},
{minRev: 2, maxRev: 4, expectEvents: expectEvents[0:2]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two cases are duplicated? see line 238-239. Better to sort the cases by minRev or maxRev?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's intentional. We group multiple cases into a scenarios. Added comments to make them easier to see. This is to better test the reuse, as the function call has side effects with modifying evs.

@serathius serathius force-pushed the sync-reuse branch 4 times, most recently from 9b39f3f to ebb0f64 Compare December 2, 2024 17:04
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

It would be wonderful if we have more performance data (i.e. in K8s scale test) to showcase the benefit of such improvement.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@serathius
Copy link
Member Author

It would be wonderful if we have more performance data (i.e. in K8s scale test) to showcase the benefit of such improvement.

Don't think K8s scale are any close to push etcd watch into congestion, if they did they K8s would not work. See kubernetes/kubernetes#123448

That's why I'm using benchmarks. They are pretty good, however without a easily accessible metric they are not used. I'm thinking about doing some MVP to run a benchmark and upload results to perf-dash K8s, however the problem is that they have a very strict schema requirement, which doesn't match etcd.

@serathius serathius merged commit 6fa7342 into etcd-io:main Dec 2, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants