-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make std::sync::Arc compatible with ThreadSanitizer #65097
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @joshtriplett (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
r? @RalfJung (take your time) |
Cc @jhjourdan who did the formal verification of current |
I don't think I know the correctness argument of Do we have anyone else who is enough of an expert in weak memory to help? IIRC @aturon wrote most of this, but they are busy with other things these days. |
Looking at the changes, this seems to mostly un-do deliberate optimizations that specifically keep synchronization to the absolute minimum that is necessary. I am not sure if "sanitizers do not properly support fences" is a good enough argument to change our code -- shouldn't sanitizers be fixed instead? Fences are an important part of Rust's and C's concurrency model. I feel before we do a correctness review, we need a policy decision by some team (T-libs, I assume) whether we are willing to reduce code quality in order to help sanitizers. |
AFAIK there are no known algorithmic approaches supporting memory barriers with Changes here have no effect on tier 1, since generated code on i686 / x86_64 is I also mentioned other implementations (libc++ / libstdc++) as implicit |
Did you test that? I know that lowering of atomic operations to x86 assembly is the same, but this change will affect compiler transformations even on x86. So there can still easily be differences. |
I extracted the code form Arc (before and after changes) and examined |
Interesting. I thought I new that these fences are equivalent to doing a release-acquire RMW from a global location, which would be trivial to check algorithmically (assuming a release/acquire checker is already implemented) but I may misremember.
Thanks, that is useful. So, the nominated question here for @rust-lang/libs is: we have a trade-off here between (a) keeping the existing, highly optimized, well-reviewed and even formally analyzed code, which however at least current dynamic thread sanitizers cannot handle properly, and (b) changing this code to be less efficient in theory due to stronger synchronization, but ultimately simpler for the same reason (and with likely no perf change in practice), working better with thread sanitizers, but also changing extremely subtle code that is very well-tested (any change has some risk of introducing a bug) and losing the existing formal results. What do you think we should do? I am asking you because I don't think I should make such calls. My personal feeling is: I am a big fan of sanitizers, so it seems worth sacrificing some entirely theoretical performance benefit of the current code for better testability. However, losing the existing formal analyses is an unfortunate regression. That said, comparing the assembly gives some assurance that at least for simple clients, the behavior did not change. |
I would like to benchmark this on ARM first since that platform seems to be affected. The main issue here is that we will have to unconditional emit an acquire fence even if the ref count isn't dropping to zero. |
I ran a quick benchmark comparison of In theory |
On Thu, Oct 10, 2019 at 10:21:34AM -0700, Amanieu wrote:
I ran a quick benchmark comparison of `fetch_sub(Release)` and `fetch_sub(AcqRel)` but couldn't fine a difference in performance (they both report a consistent 20ns). I did check the assembly and `Release` uses `ldxr` while `AcqRel` uses `ldaxr`, and that is the only difference.
In theory `ldaxr` is slower than `ldxr` since it acts as an acquire barrier which prevents loads after it from being executed before it in an out-of-order CPU. I guess this isn't visible in this benchmark since there are no other loads.
Right; in general you have to take care in benchmarking atomic
operations and synchronized operations, as their costs depend greatly on
other operations in the same CPU's pipeline, and on the cache state of the
system (which depends on ongoing operations on other CPUs).
|
The comments in the code indicate where all this logic originally came from (Boost) and its history also shows that this is an extremely sensitive operation for performance (see #41714 for example). Benchmarking this change to evaluate its performance impact would be quite difficult, but leaning on users who have previously benchmarked various changes here (such as Servo/Gecko) is a good start. I don't think that the correctness of the current code is in question, this looks like it's a change intended on making |
Ping from triage: @rust-lang/libs @tmiasko any updates on this? |
To summarize this PR is changing a few cases of: fn drop(&mut self) {
if self.inner().strong.fetch_sub(1, Release) != 1 {
return;
}
atomic::fence(Acquire);
/* drop contents of Arc */
} to: fn drop(&mut self) {
if self.inner().strong.fetch_sub(1, AcqRel) != 1 {
return;
}
/* drop contents of Arc */
} This places an extra requirement to do an acquire synchronization in what may be hot code (cloning and dropping Arc references should be cheap). Before the acquire was in the colder path that drops the final reference of the Arc and drops the contents. If avoiding fences is a goal, something like #41714 seems like the much better option to me. I.e. change the same code to: fn drop(&mut self) {
if self.inner().strong.fetch_sub(1, Release) != 1 {
return;
}
let _ = self.inner().strong.load(Acquire);
/* drop contents of Arc */
} This keeps the hot path the same, and it adds at most an extra I would actually feel better about an acquire load instead of a fence, because of the trickiness of fences (Atomic-fence synchronization). Using a fence just to optimize out one instruction in a colder path seems like a questionable optimization. Especially because the processor knows just as good as we do that there are no other references to this atomic. So it can't have been changed in the meantime, and should still be in a register or cache. The |
According to Atomic-fence synchronization a fence binds to a nearby atomic operation. atomic_a.load(Relaxed);
atomic_b.store(Relaxed);
fence(Acquire);
atomic_b.load(Relaxed); An acquire fence works together with the atomic that last did a load operation, in the example In the same vein The reordering rules for fences are carefully worded to not talk about themselves, but the previous read or the next write (Acquire and Release Fences Don't Work the Way You'd Expect). In all cases a fence needs operations on some atomic to bind to, a fence-fence synchronization without atomics does not seem to be a thing. |
Indeed, synchronization always arises from a reads-from edge. Fences can just "upgrade" relaxed reads/writes to still incur synchronization. |
Ping from triage: @tmiasko and @rust-lang/libs what do you think about the above comments? |
I think my previous comment about a performance investigation still stands. |
The approach suggested so far (including current implementation) are:
I performed micro benchmarks on x86-64 (https://github.com/tmiasko/arc):
@pitdicker I agree that the second approach would be preferable to the third @alexcrichton I looked briefly at servo benchmarks (test-dromaeo, test-perf), Unfortunately neither of those approaches is a Pareto improvement over current |
It's really difficult to measure the performance here unfortunately, and I think that focusing only on one architecture may be missing the purpose of these atomics as well. I also believe that (2) as you proposed is an incorrect solution because I believe that an acquire load only synchronizes with one release store, and the reason we use a fence is want to synchronize with all of the release stores, not just one. |
If you change the barrier types when ThreadSanitizer is enabled, you risk masking bugs you might have been trying to debug.
(And also, we *definitely* shouldn't change the required barrier type unconditionally based on the limitations of a debugging tool.)
|
Note that the synchronization used here is not really stronger, so there is no Furthermore if you are actually trying to debug issues in Arc / Weak, |
One could actually argue synchronization is weaker, because an acquire fence syncs with all release writes/fences, while an acquire load only syncs with release writes to the same location. |
Make std::sync::Arc compatible with ThreadSanitizer The memory fences used previously in Arc implementation are not properly understood by thread sanitizer as synchronization primitives. This had unfortunate effect where running any non-trivial program compiled with `-Z sanitizer=thread` would result in numerous false positives. Replace acquire fences with acquire loads to address the issue. Fixes rust-lang#39608.
Make std::sync::Arc compatible with ThreadSanitizer The memory fences used previously in Arc implementation are not properly understood by thread sanitizer as synchronization primitives. This had unfortunate effect where running any non-trivial program compiled with `-Z sanitizer=thread` would result in numerous false positives. Replace acquire fences with acquire loads to address the issue. Fixes rust-lang#39608.
Make std::sync::Arc compatible with ThreadSanitizer The memory fences used previously in Arc implementation are not properly understood by thread sanitizer as synchronization primitives. This had unfortunate effect where running any non-trivial program compiled with `-Z sanitizer=thread` would result in numerous false positives. Replace acquire fences with acquire loads to address the issue. Fixes rust-lang#39608.
Rollup of 16 pull requests Successful merges: - rust-lang#65097 (Make std::sync::Arc compatible with ThreadSanitizer) - rust-lang#69033 (Use generator resume arguments in the async/await lowering) - rust-lang#69997 (add `Option::{zip,zip_with}` methods under "option_zip" gate) - rust-lang#70038 (Remove the call that makes miri fail) - rust-lang#70058 (can_begin_literal_maybe_minus: `true` on `"-"? lit` NTs.) - rust-lang#70111 (BTreeMap: remove shared root) - rust-lang#70139 (add delay_span_bug to TransmuteSizeDiff, just to be sure) - rust-lang#70165 (Remove the erase regions MIR transform) - rust-lang#70166 (Derive PartialEq, Eq and Hash for RangeInclusive) - rust-lang#70176 (Add tests for rust-lang#58319 and rust-lang#65131) - rust-lang#70177 (Fix oudated comment for NamedRegionMap) - rust-lang#70184 (expand_include: set `.directory` to dir of included file.) - rust-lang#70187 (more clippy fixes) - rust-lang#70188 (Clean up E0439 explanation) - rust-lang#70189 (Abi::is_signed: assert that we are a Scalar) - rust-lang#70194 (#[must_use] on split_off()) Failed merges: r? @ghost
☔ The latest upstream changes (presumably #70205) made this pull request unmergeable. Please resolve the merge conflicts. |
Thank you @tmiasko for working on this. This should get us one step closer to find races between C++ and Rust code in Firefox. And fwiw, we have taken similar measures to replace a fence in our implementation with a load for TSan: |
The memory fences used previously in Arc implementation are not properly
understood by thread sanitizer as synchronization primitives. This had
unfortunate effect where running any non-trivial program compiled with
-Z sanitizer=thread
would result in numerous false positives.Replace acquire fences with acquire loads to address the issue.
Fixes #39608.