-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rt: refactor runtime to avoid Rc<RefCell<...>> #142
Conversation
Huh, that doesn't happen on my machine. |
Hmm, actually it does, its just dependent on the order the tests run in (maybe?) |
Nevermind, CLion has a bug and apparantly hides test failures with double panics. |
Well that worked. |
Nice. You've removed the Rc from the highest level because the driver reference isn't being kept per running Op any longer. The running Op can access the TLS variable to get to the slab and the uring instance. This doesn't completely rule out ever being able to support multiple urings in the same current thread runtime, but it would require some further refactoring, as it would have anyway. It's good to differentiate the runtime in this crate from the uring instance, it would seem to allow further flexibility later on. |
I'm going to be opening several additional refactoring PRs. I'd like to combine a lot of our TCP and Unix logic similar to what tokio does internally with PollEvented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this
Hmm, I tried merging #144 with this, and it seems this gives a performance regression. I have to admit I'm surprised by this. no_op/1 time: [205.00 ms 205.42 ms 205.85 ms]
thrpt: [485.78 Kelem/s 486.81 Kelem/s 487.81 Kelem/s]
change:
time: [+35.581% +36.358% +37.090%] (p = 0.00 < 0.05)
thrpt: [-27.055% -26.664% -26.244%]
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
no_op/32 time: [29.356 ms 29.435 ms 29.521 ms]
thrpt: [3.3874 Melem/s 3.3973 Melem/s 3.4064 Melem/s]
change:
time: [+3.9926% +4.6145% +5.2585%] (p = 0.00 < 0.05)
thrpt: [-4.9958% -4.4110% -3.8393%]
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe
no_op/64 time: [26.168 ms 26.226 ms 26.291 ms]
thrpt: [3.8035 Melem/s 3.8130 Melem/s 3.8215 Melem/s]
change:
time: [+3.5382% +4.0286% +4.5123%] (p = 0.00 < 0.05)
thrpt: [-4.3175% -3.8725% -3.4173%]
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
no_op/256 time: [24.032 ms 24.080 ms 24.135 ms]
thrpt: [4.1434 Melem/s 4.1529 Melem/s 4.1612 Melem/s]
change:
time: [-0.1899% +0.1773% +0.5497%] (p = 0.36 > 0.05)
thrpt: [-0.5467% -0.1770% +0.1903%]
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
Running benches/lai/no_op.rs (target/release/deps/lai_no_op-60be63e49ca92170)
runtime_only
Instructions: 30336 (+6.464519%)
L1 Accesses: 43128 (+6.368076%)
L2 Accesses: 282 (+16.52893%)
RAM Accesses: 1014 (+3.787103%)
Estimated Cycles: 80028 (+5.367935%)
no_op_x1
Instructions: 782557272 (+55.93528%)
L1 Accesses: 1167943224 (+54.27984%)
L2 Accesses: 4511358 (+204.9696%)
RAM Accesses: 1432 (+9.647779%)
Estimated Cycles: 1190550134 (+55.73512%)
no_op_x32
Instructions: 142350658 (+9.518945%)
L1 Accesses: 221759389 (+8.211193%)
L2 Accesses: 143991 (+208.8676%)
RAM Accesses: 1497 (+9.190372%)
Estimated Cycles: 222531739 (+8.439342%)
no_op_x64
Instructions: 132014910 (+6.487231%)
L1 Accesses: 206460259 (+5.337330%)
L2 Accesses: 96183 (+155.6155%)
RAM Accesses: 1565 (+8.756081%)
Estimated Cycles: 206995949 (+5.482285%)
no_op_x256
Instructions: 124419880 (+3.999416%)
L1 Accesses: 194515551 (+2.956976%)
L2 Accesses: 767606 (+18.30563%)
RAM Accesses: 1990 (+6.588109%)
Estimated Cycles: 198423231 (+3.217230%) |
More notes on performance. Adding LTO flags to the release build the fairly dreadful no_op/1 benchmark regresses significantly vs #144, but the 256 way concurrency is actually better. no_op/1 time: [182.93 ms 183.70 ms 184.50 ms]
thrpt: [541.99 Kelem/s 544.37 Kelem/s 546.67 Kelem/s]
change:
time: [+40.007% +40.654% +41.302%] (p = 0.00 < 0.05)
thrpt: [-29.229% -28.904% -28.575%]
Performance has regressed.
no_op/256 time: [19.945 ms 20.011 ms 20.087 ms]
thrpt: [4.9784 Melem/s 4.9973 Melem/s 5.0139 Melem/s]
change:
time: [-3.8566% -3.4359% -2.9896%] (p = 0.00 < 0.05)
thrpt: [+3.0817% +3.5582% +4.0114%]
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
no_op_x1
Instructions: 566777937 (+53.60089%)
L1 Accesses: 828856458 (+51.37721%)
L2 Accesses: 312 (-99.96116%)
RAM Accesses: 1202 (+8.679928%)
Estimated Cycles: 828900088 (+50.27212%)
no_op_x256
Instructions: 89362937 (-1.095379%)
L1 Accesses: 140236031 (-2.657184%)
L2 Accesses: 732513 (+14.23632%)
RAM Accesses: 1760 (+5.960265%)
Estimated Cycles: 143960196 (-2.286151%) |
This is likely noise. We might want to try using a longer benchmark period for this. |
@Noah-Kennedy the benchmarks are performing 100k |
Ah, interesting. |
I'll profile this tonight then to see what is happening. |
@Noah-Kennedy : You may find
Given your profile indicates you work for Cloudflare, I suspect you're fairly clued up on profiling anyway. My experience is if I haven't got something terribly wrong, a criterion result of more than 2% is real. Indeed, trying to find many 2% improvements is basically how I spend my days. I'm having a look at the benchmarks on master currently |
And looking at the flamegraphs, you can just disregard any low number of concurrency. thread_park totally dominates |
How to interpret the no_op/1 name and the no_op/256 names? Is the / indicating the number of concurrent tasks spawned? What does it mean that there are per second numbers? Is a spawned task replaced by another when the first one finishes? Or are there a certain number run per second? I'm wondering if no_op/1 means just one operation is performed at a time? So there is little stress on the system and the test has to wait for the round trip from userland to kernel and back to userland for a single test to be counted before the next is started. And how many cores is this running on? Sorry if that is mentioned somewhere already. And is this a native machine or a vm on a hypervisor? |
Do we want to gate this PR on improving the no-op perf? I'd vote for not doing so, but I'd like to get your thoughts @ollie-etl @FrankReh. |
I thought that's what the two of you wanted and maybe you, Noah, were going to find time to check out the criterion results. I would rather we get this committed after the review that I started a few minutes ago. At least two of us want to get better at criterion analysis but we should err on the side cleaning up the design so further work can be more fruitful. There is a backlog of work the three of us have, as well as PRs in the queue, and we have the chance to get to them all soon now. |
@FrankReh I think that we should probably merge this asap since it touches things we want to change in other PRs. Merging it post-review sounds like a good idea here. I'll resolve the conflicts. |
This change refactors the runtime to not use reference counting directly in the ops themselves. Instead, ops access the driver via thread local variables. This is sound because dropping the driver (which happens when it is removed from its thread-local state) blocks the thread until all ops complete, ensuring that we do not free the contents of the driver until after all operations have completed.
a3084b4
to
62b42ab
Compare
@FrankReh merge conflicts are resolved |
From my perspective, please merge |
Cool, as soon as @FrankReh signs off, I'll merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Small questions.
Be back in an hour. |
@FrankReh thanks for the review! I've addressed your findings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me!
Interesting. A few rounds of testing my buf_ring branch with these changes. Nothing is broken but it could be the driver tick isn't being called as often as it had been before. That's interesting. We want to visit a separate mechanism for submission and completion checking anyway. |
Oh we absolutely need to. Like I've said before, we should do this the same way that tokio invokes epoll_wait, which is to say by leveraging the maintenance hook for periodic polling, and by then making use of parking (as we do now). |
I suspect that the main impacting change is moving the driver ticking to a separate task. |
Me too. The tokio scheduler could make us crazy if we worry too much about which task gets to go most often and how long the ready queue is getting in our stress/flood tests. |
It would probably be best if we tried to avoid using the task as anything other than a way of getting tokio to unpark on uring completions arriving. I suspect that driving completions on unpark would potentially help improve perf a fair bit, as well as periodically flushing the queue every X ticks by getting something into tokio to hook into the maintenance routine. |
- tokio-rs#148 exposed Runtime. It is now possible to use Criterions async support directly. Do so - Adds criterion support for flamegraph generation of only the benchmarked code (pprof) - Now crashes on second call to block_on, as the AsyncFd still exists Introduced in tokio-rs#142
# 0.4.0 (November 5th, 2022) ### Fixed - Fix panic in Deref/DerefMut for Slice extending into uninitialized part of the buffer ([#52]) - docs: all-features = true ([#84]) - fix fs unit tests to avoid parallelism ([#121]) - Box the socket address to allow moving the Connect future ([#126]) - rt: Fix data race ([#146]) ### Added - Implement fs::File::readv_at()/writev_at() ([#87]) - fs: implement FromRawFd for File ([#89]) - Implement `AsRawFd` for `TcpStream` ([#94]) - net: add TcpListener.local_addr method ([#107]) - net: add TcpStream.write_all ([#111]) - driver: add Builder API as an option to start ([#113]) - Socket and TcpStream shutdown ([#124]) - fs: implement fs::File::from_std ([#131]) - net: implement FromRawFd for TcpStream ([#132]) - fs: implement OpenOptionsExt for OpenOptions ([#133]) - Add NoOp support ([#134]) - Add writev to TcpStream ([#136]) - sync TcpStream, UnixStream and UdpSocket functionality ([#141]) - Add benchmarks for no-op submission ([#144]) - Expose runtime structure ([#148]) ### Changed - driver: batch submit requests and add benchmark ([#78]) - Depend on io-uring version ^0.5.8 ([#153]) ### Internal Improvements - chore: fix clippy lints ([#99]) - io: refactor post-op logic in ops into Completable ([#116]) - Support multi completion events: v2 ([#130]) - simplify driver operation futures ([#139]) - rt: refactor runtime to avoid Rc\<RefCell\<...>> ([#142]) - Remove unused dev-dependencies ([#143]) - chore: types and fields explicitly named ([#149]) - Ignore errors from uring while cleaning up ([#154]) - rt: drop runtime before driver during shutdown ([#155]) - rt: refactor drop logic ([#157]) - rt: fix error when calling block_on twice ([#162]) ### CI changes - chore: update actions/checkout action to v3 ([#90]) - chore: add all-systems-go ci check ([#98]) - chore: add clippy to ci ([#100]) - ci: run cargo test --doc ([#135]) [#52]: #52 [#78]: #78 [#84]: #84 [#87]: #87 [#89]: #89 [#90]: #90 [#94]: #94 [#98]: #98 [#99]: #99 [#100]: #100 [#107]: #107 [#111]: #111 [#113]: #113 [#116]: #116 [#121]: #121 [#124]: #124 [#126]: #126 [#130]: #130 [#131]: #131 [#132]: #132 [#133]: #133 [#134]: #134 [#135]: #135 [#136]: #136 [#139]: #139 [#141]: #141 [#142]: #142 [#143]: #143 [#144]: #144 [#146]: #146 [#148]: #148 [#149]: #149 [#153]: #153 [#154]: #154 [#155]: #155 [#157]: #157 [#162]: #162
This change refactors the runtime to not use reference counting directly in the ops themselves. Instead, ops access the driver via thread local variables.
This is sound because dropping the driver (which happens when it is removed from its thread-local state) blocks the thread until all ops complete, ensuring that we do not free the contents of the driver until after all operations have completed.