rt: refactor runtime to avoid Rc<RefCell<...>> #142

Noah-Kennedy · 2022-10-23T00:33:16Z

This change refactors the runtime to not use reference counting directly in the ops themselves. Instead, ops access the driver via thread local variables.

This is sound because dropping the driver (which happens when it is removed from its thread-local state) blocks the thread until all ops complete, ensuring that we do not free the contents of the driver until after all operations have completed.

Noah-Kennedy · 2022-10-23T00:37:59Z

Huh, that doesn't happen on my machine.

Noah-Kennedy · 2022-10-23T00:40:09Z

Hmm, actually it does, its just dependent on the order the tests run in (maybe?)

Noah-Kennedy · 2022-10-23T00:43:20Z

Nevermind, CLion has a bug and apparantly hides test failures with double panics.

Noah-Kennedy · 2022-10-23T00:56:00Z

Well that worked.

FrankReh · 2022-10-23T02:36:53Z

Nice. You've removed the Rc from the highest level because the driver reference isn't being kept per running Op any longer. The running Op can access the TLS variable to get to the slab and the uring instance.

This doesn't completely rule out ever being able to support multiple urings in the same current thread runtime, but it would require some further refactoring, as it would have anyway. It's good to differentiate the runtime in this crate from the uring instance, it would seem to allow further flexibility later on.

Noah-Kennedy · 2022-10-23T03:58:33Z

I'm going to be opening several additional refactoring PRs. I'd like to combine a lot of our TCP and Unix logic similar to what tokio does internally with PollEvented.

ollie-etl

I like this

ollie-etl · 2022-10-24T16:23:27Z

Hmm, I tried merging #144 with this, and it seems this gives a performance regression. I have to admit I'm surprised by this.
Values are vs #144

no_op/1                 time:   [205.00 ms 205.42 ms 205.85 ms]
                        thrpt:  [485.78 Kelem/s 486.81 Kelem/s 487.81 Kelem/s]
                 change:
                        time:   [+35.581% +36.358% +37.090%] (p = 0.00 < 0.05)
                        thrpt:  [-27.055% -26.664% -26.244%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
no_op/32                time:   [29.356 ms 29.435 ms 29.521 ms]
                        thrpt:  [3.3874 Melem/s 3.3973 Melem/s 3.4064 Melem/s]
                 change:
                        time:   [+3.9926% +4.6145% +5.2585%] (p = 0.00 < 0.05)
                        thrpt:  [-4.9958% -4.4110% -3.8393%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
no_op/64                time:   [26.168 ms 26.226 ms 26.291 ms]
                        thrpt:  [3.8035 Melem/s 3.8130 Melem/s 3.8215 Melem/s]
                 change:
                        time:   [+3.5382% +4.0286% +4.5123%] (p = 0.00 < 0.05)
                        thrpt:  [-4.3175% -3.8725% -3.4173%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
no_op/256               time:   [24.032 ms 24.080 ms 24.135 ms]
                        thrpt:  [4.1434 Melem/s 4.1529 Melem/s 4.1612 Melem/s]
                 change:
                        time:   [-0.1899% +0.1773% +0.5497%] (p = 0.36 > 0.05)
                        thrpt:  [-0.5467% -0.1770% +0.1903%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

     Running benches/lai/no_op.rs (target/release/deps/lai_no_op-60be63e49ca92170)
runtime_only
  Instructions:               30336 (+6.464519%)
  L1 Accesses:                43128 (+6.368076%)
  L2 Accesses:                  282 (+16.52893%)
  RAM Accesses:                1014 (+3.787103%)
  Estimated Cycles:           80028 (+5.367935%)

no_op_x1
  Instructions:           782557272 (+55.93528%)
  L1 Accesses:           1167943224 (+54.27984%)
  L2 Accesses:              4511358 (+204.9696%)
  RAM Accesses:                1432 (+9.647779%)
  Estimated Cycles:      1190550134 (+55.73512%)

no_op_x32
  Instructions:           142350658 (+9.518945%)
  L1 Accesses:            221759389 (+8.211193%)
  L2 Accesses:               143991 (+208.8676%)
  RAM Accesses:                1497 (+9.190372%)
  Estimated Cycles:       222531739 (+8.439342%)

no_op_x64
  Instructions:           132014910 (+6.487231%)
  L1 Accesses:            206460259 (+5.337330%)
  L2 Accesses:                96183 (+155.6155%)
  RAM Accesses:                1565 (+8.756081%)
  Estimated Cycles:       206995949 (+5.482285%)

no_op_x256
  Instructions:           124419880 (+3.999416%)
  L1 Accesses:            194515551 (+2.956976%)
  L2 Accesses:               767606 (+18.30563%)
  RAM Accesses:                1990 (+6.588109%)
  Estimated Cycles:       198423231 (+3.217230%)

ollie-etl · 2022-10-24T16:46:08Z

More notes on performance. Adding LTO flags to the release build git cherry-pick 7d3d8df, vs #144 with the same.

the fairly dreadful no_op/1 benchmark regresses significantly vs #144, but the 256 way concurrency is actually better.

no_op/1                 time:   [182.93 ms 183.70 ms 184.50 ms]
                        thrpt:  [541.99 Kelem/s 544.37 Kelem/s 546.67 Kelem/s]
                 change:
                        time:   [+40.007% +40.654% +41.302%] (p = 0.00 < 0.05)
                        thrpt:  [-29.229% -28.904% -28.575%]
                        Performance has regressed.

no_op/256               time:   [19.945 ms 20.011 ms 20.087 ms]
                        thrpt:  [4.9784 Melem/s 4.9973 Melem/s 5.0139 Melem/s]
                 change:
                        time:   [-3.8566% -3.4359% -2.9896%] (p = 0.00 < 0.05)
                        thrpt:  [+3.0817% +3.5582% +4.0114%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

no_op_x1
  Instructions:           566777937 (+53.60089%)
  L1 Accesses:            828856458 (+51.37721%)
  L2 Accesses:                  312 (-99.96116%)
  RAM Accesses:                1202 (+8.679928%)
  Estimated Cycles:       828900088 (+50.27212%)

no_op_x256
  Instructions:            89362937 (-1.095379%)
  L1 Accesses:            140236031 (-2.657184%)
  L2 Accesses:               732513 (+14.23632%)
  RAM Accesses:                1760 (+5.960265%)
  Estimated Cycles:       143960196 (-2.286151%)

Noah-Kennedy · 2022-10-24T17:00:26Z

This is likely noise. We might want to try using a longer benchmark period for this.

ollie-etl · 2022-10-24T17:13:31Z

@Noah-Kennedy the benchmarks are performing 100k Noop, and iterating 100-300 times. Each criterion benchmark is therefore 10M+ operations.Each takes 5-20 seconds. I can reproduce very consistently.

Noah-Kennedy · 2022-10-24T20:13:38Z

Ah, interesting.

Noah-Kennedy · 2022-10-24T20:14:22Z

I'll profile this tonight then to see what is happening.

ollie-etl · 2022-10-24T20:48:56Z

@Noah-Kennedy : You may find simple-benchmark-2 from git@github.com:etlsystems/tokio-uring helpful. It extends the simple-benchmark I've submitted as a PR with a few things:

much cleaner benchmark code, enabled by exposing the runtime (runtime: expose the Runtime structure #88 is open)
pprof flamegraph support.
Adds lto flags and targets modern x86s, for representative benchmarks

Given your profile indicates you work for Cloudflare, I suspect you're fairly clued up on profiling anyway. My experience is if I haven't got something terribly wrong, a criterion result of more than 2% is real. Indeed, trying to find many 2% improvements is basically how I spend my days.

I'm having a look at the benchmarks on master currently

ollie-etl · 2022-10-24T20:50:12Z

And looking at the flamegraphs, you can just disregard any low number of concurrency. thread_park totally dominates

FrankReh · 2022-10-26T00:57:36Z

the fairly dreadful no_op/1 benchmark regresses significantly vs #144, but the 256 way concurrency is actually better.

How to interpret the no_op/1 name and the no_op/256 names? Is the / indicating the number of concurrent tasks spawned? What does it mean that there are per second numbers? Is a spawned task replaced by another when the first one finishes? Or are there a certain number run per second?

I'm wondering if no_op/1 means just one operation is performed at a time? So there is little stress on the system and the test has to wait for the round trip from userland to kernel and back to userland for a single test to be counted before the next is started.

And how many cores is this running on? Sorry if that is mentioned somewhere already. And is this a native machine or a vm on a hypervisor?

Noah-Kennedy · 2022-10-26T17:00:12Z

Do we want to gate this PR on improving the no-op perf? I'd vote for not doing so, but I'd like to get your thoughts @ollie-etl @FrankReh.

FrankReh · 2022-10-26T17:06:23Z

Do we want to gate this PR on improving the no-op perf? I'd vote for not doing so, but I'd like to get your thoughts @ollie-etl @FrankReh.

I thought that's what the two of you wanted and maybe you, Noah, were going to find time to check out the criterion results.

I would rather we get this committed after the review that I started a few minutes ago. At least two of us want to get better at criterion analysis but we should err on the side cleaning up the design so further work can be more fruitful. There is a backlog of work the three of us have, as well as PRs in the queue, and we have the chance to get to them all soon now.

Noah-Kennedy · 2022-10-26T17:10:11Z

@FrankReh I think that we should probably merge this asap since it touches things we want to change in other PRs. Merging it post-review sounds like a good idea here.

I'll resolve the conflicts.

This change refactors the runtime to not use reference counting directly in the ops themselves. Instead, ops access the driver via thread local variables. This is sound because dropping the driver (which happens when it is removed from its thread-local state) blocks the thread until all ops complete, ensuring that we do not free the contents of the driver until after all operations have completed.

Noah-Kennedy · 2022-10-26T17:21:39Z

@FrankReh merge conflicts are resolved

oliverbunting · 2022-10-26T17:52:25Z

From my perspective, please merge

Noah-Kennedy · 2022-10-26T17:58:33Z

Cool, as soon as @FrankReh signs off, I'll merge.

FrankReh

Cool. Small questions.

src/driver/mod.rs

src/runtime/context.rs

src/runtime/mod.rs

src/util.rs

FrankReh · 2022-10-26T18:07:23Z

Be back in an hour.

Noah-Kennedy · 2022-10-26T18:18:13Z

@FrankReh thanks for the review! I've addressed your findings.

FrankReh

Looks very good to me!

FrankReh · 2022-10-26T19:44:15Z

Interesting. A few rounds of testing my buf_ring branch with these changes. Nothing is broken but it could be the driver tick isn't being called as often as it had been before. That's interesting. We want to visit a separate mechanism for submission and completion checking anyway.

Noah-Kennedy · 2022-10-26T19:48:23Z

Oh we absolutely need to. Like I've said before, we should do this the same way that tokio invokes epoll_wait, which is to say by leveraging the maintenance hook for periodic polling, and by then making use of parking (as we do now).

Noah-Kennedy · 2022-10-26T19:50:13Z

I suspect that the main impacting change is moving the driver ticking to a separate task.

FrankReh · 2022-10-26T20:38:40Z

I suspect that the main impacting change is moving the driver ticking to a separate task.

Me too. The tokio scheduler could make us crazy if we worry too much about which task gets to go most often and how long the ready queue is getting in our stress/flood tests.

Noah-Kennedy · 2022-10-26T20:51:16Z

It would probably be best if we tried to avoid using the task as anything other than a way of getting tokio to unpark on uring completions arriving. I suspect that driving completions on unpark would potentially help improve perf a fair bit, as well as periodically flushing the queue every X ticks by getting something into tokio to hook into the maintenance routine.

- tokio-rs#148 exposed Runtime. It is now possible to use Criterions async support directly. Do so - Adds criterion support for flamegraph generation of only the benchmarked code (pprof) - Now crashes on second call to block_on, as the AsyncFd still exists Introduced in tokio-rs#142

# 0.4.0 (November 5th, 2022) ### Fixed - Fix panic in Deref/DerefMut for Slice extending into uninitialized part of the buffer ([#52]) - docs: all-features = true ([#84]) - fix fs unit tests to avoid parallelism ([#121]) - Box the socket address to allow moving the Connect future ([#126]) - rt: Fix data race ([#146]) ### Added - Implement fs::File::readv_at()/writev_at() ([#87]) - fs: implement FromRawFd for File ([#89]) - Implement `AsRawFd` for `TcpStream` ([#94]) - net: add TcpListener.local_addr method ([#107]) - net: add TcpStream.write_all ([#111]) - driver: add Builder API as an option to start ([#113]) - Socket and TcpStream shutdown ([#124]) - fs: implement fs::File::from_std ([#131]) - net: implement FromRawFd for TcpStream ([#132]) - fs: implement OpenOptionsExt for OpenOptions ([#133]) - Add NoOp support ([#134]) - Add writev to TcpStream ([#136]) - sync TcpStream, UnixStream and UdpSocket functionality ([#141]) - Add benchmarks for no-op submission ([#144]) - Expose runtime structure ([#148]) ### Changed - driver: batch submit requests and add benchmark ([#78]) - Depend on io-uring version ^0.5.8 ([#153]) ### Internal Improvements - chore: fix clippy lints ([#99]) - io: refactor post-op logic in ops into Completable ([#116]) - Support multi completion events: v2 ([#130]) - simplify driver operation futures ([#139]) - rt: refactor runtime to avoid Rc\<RefCell\<...>> ([#142]) - Remove unused dev-dependencies ([#143]) - chore: types and fields explicitly named ([#149]) - Ignore errors from uring while cleaning up ([#154]) - rt: drop runtime before driver during shutdown ([#155]) - rt: refactor drop logic ([#157]) - rt: fix error when calling block_on twice ([#162]) ### CI changes - chore: update actions/checkout action to v3 ([#90]) - chore: add all-systems-go ci check ([#98]) - chore: add clippy to ci ([#100]) - ci: run cargo test --doc ([#135]) [#52]: #52 [#78]: #78 [#84]: #84 [#87]: #87 [#89]: #89 [#90]: #90 [#94]: #94 [#98]: #98 [#99]: #99 [#100]: #100 [#107]: #107 [#111]: #111 [#113]: #113 [#116]: #116 [#121]: #121 [#124]: #124 [#126]: #126 [#130]: #130 [#131]: #131 [#132]: #132 [#133]: #133 [#134]: #134 [#135]: #135 [#136]: #136 [#139]: #139 [#141]: #141 [#142]: #142 [#143]: #143 [#144]: #144 [#146]: #146 [#148]: #148 [#149]: #149 [#153]: #153 [#154]: #154 [#155]: #155 [#157]: #157 [#162]: #162

Noah-Kennedy requested a review from carllerche October 23, 2022 00:56

ollie-etl reviewed Oct 23, 2022

View reviewed changes

Noah-Kennedy and others added 3 commits October 26, 2022 12:19

fix double panic

78b5f73

resolve conflicts

62b42ab

Noah-Kennedy force-pushed the noah/runtime-refactor branch from a3084b4 to 62b42ab Compare October 26, 2022 17:21

FrankReh requested changes Oct 26, 2022

View reviewed changes

remove unnecessary mut

1e77f99

Noah-Kennedy added 2 commits October 26, 2022 13:16

fix comment

ba81cc9

rename with_driver

ab26861

FrankReh approved these changes Oct 26, 2022

View reviewed changes

Noah-Kennedy merged commit c069cc6 into master Oct 26, 2022

Noah-Kennedy deleted the noah/runtime-refactor branch October 26, 2022 19:19

ollie-etl mentioned this pull request Oct 31, 2022

bug: Calling block_on twice causes a panic #160

Closed

Noah-Kennedy mentioned this pull request Nov 5, 2022

chore: prepare tokio-uring v0.4.0 #166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rt: refactor runtime to avoid Rc<RefCell<...>> #142

rt: refactor runtime to avoid Rc<RefCell<...>> #142

Noah-Kennedy commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

FrankReh commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

ollie-etl left a comment

ollie-etl commented Oct 24, 2022 •

edited

Loading

ollie-etl commented Oct 24, 2022

Noah-Kennedy commented Oct 24, 2022

ollie-etl commented Oct 24, 2022 •

edited

Loading

Noah-Kennedy commented Oct 24, 2022

Noah-Kennedy commented Oct 24, 2022

ollie-etl commented Oct 24, 2022 •

edited

Loading

ollie-etl commented Oct 24, 2022

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

oliverbunting commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

FrankReh left a comment

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

FrankReh left a comment

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

rt: refactor runtime to avoid Rc<RefCell<...>> #142

rt: refactor runtime to avoid Rc<RefCell<...>> #142

Conversation

Noah-Kennedy commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

FrankReh commented Oct 23, 2022

Noah-Kennedy commented Oct 23, 2022

ollie-etl left a comment

Choose a reason for hiding this comment

ollie-etl commented Oct 24, 2022 • edited Loading

ollie-etl commented Oct 24, 2022

Noah-Kennedy commented Oct 24, 2022

ollie-etl commented Oct 24, 2022 • edited Loading

Noah-Kennedy commented Oct 24, 2022

Noah-Kennedy commented Oct 24, 2022

ollie-etl commented Oct 24, 2022 • edited Loading

ollie-etl commented Oct 24, 2022

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

oliverbunting commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

FrankReh left a comment

Choose a reason for hiding this comment

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

FrankReh left a comment

Choose a reason for hiding this comment

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

FrankReh commented Oct 26, 2022

Noah-Kennedy commented Oct 26, 2022

ollie-etl commented Oct 24, 2022 •

edited

Loading

ollie-etl commented Oct 24, 2022 •

edited

Loading

ollie-etl commented Oct 24, 2022 •

edited

Loading