Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize vec::retain performance #91527

Merged
merged 2 commits into from
Dec 16, 2021
Merged

Optimize vec::retain performance #91527

merged 2 commits into from
Dec 16, 2021

Conversation

the8472
Copy link
Member

@the8472 the8472 commented Dec 4, 2021

This simply moves the loops into the inner function which leads to better results.

old:

test vec::bench_retain_100000                            ... bench:     203,828 ns/iter (+/- 2,101)
test vec::bench_retain_iter_100000                       ... bench:      63,324 ns/iter (+/- 12,305)
test vec::bench_retain_whole_100000                      ... bench:      42,989 ns/iter (+/- 291)


new:

test vec::bench_retain_100000                            ... bench:      42,180 ns/iter (+/- 451)
test vec::bench_retain_iter_100000                       ... bench:      65,167 ns/iter (+/- 11,971)
test vec::bench_retain_whole_100000                      ... bench:      33,736 ns/iter (+/- 12,404)

Measured on x86_64-unknown-linux-gnu, Zen2

Fixes #91497

Add `into_iter().filter().collect()` as a comparison point since it was reported to be faster than `retain`.
Remove clone inside benchmark loop to reduce allocator noise.
This simply moves the loops into the inner function which leads to better results.


```
old:

test vec::bench_retain_100000                            ... bench:     203,828 ns/iter (+/- 2,101)
test vec::bench_retain_iter_100000                       ... bench:      63,324 ns/iter (+/- 12,305)
test vec::bench_retain_whole_100000                      ... bench:      42,989 ns/iter (+/- 291)


new:

test vec::bench_retain_100000                            ... bench:      42,180 ns/iter (+/- 451)
test vec::bench_retain_iter_100000                       ... bench:      65,167 ns/iter (+/- 11,971)
test vec::bench_retain_whole_100000                      ... bench:      33,736 ns/iter (+/- 12,404)
```
@rust-highfive
Copy link
Collaborator

r? @joshtriplett

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 4, 2021
@the8472
Copy link
Member Author

the8472 commented Dec 4, 2021

retain is used in some places in the compiler, some are fairly warm code, let's see if it has an impact.

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 4, 2021
@bors
Copy link
Contributor

bors commented Dec 4, 2021

⌛ Trying commit 67180ef with merge 7c69e77aa6984d2d2e1aa554041d3fdf92ee19f5...

@kageru
Copy link

kageru commented Dec 4, 2021

Testing this PR on the system from my report in #91497 (Ryzen 5900X with Linux).
Old:

test vec::bench_retain_100000                            ... bench:     188,266 ns/iter (+/- 4,000)
test vec::bench_retain_whole_100000                      ... bench:      42,654 ns/iter (+/- 1,221)

New:

test vec::bench_retain_100000                            ... bench:      41,626 ns/iter (+/- 148)
test vec::bench_retain_iter_100000                       ... bench:      41,319 ns/iter (+/- 1,042)
test vec::bench_retain_whole_100000                      ... bench:      21,833 ns/iter (+/- 164)

Looking much better.

@TennyZhuang
Copy link
Contributor

TennyZhuang commented Dec 4, 2021

Interesting, I don't even know why LLVM can't optimize this code, but the benchmark is very convincing.

@hkratz
Copy link
Contributor

hkratz commented Dec 4, 2021

Performs similar to what I came up with before seeing that you claimed the issue:

    #[unstable(feature = "vec_retain_mut", issue = "90829")]
    pub fn retain_mut<F>(&mut self, mut f: F)
    where
        F: FnMut(&mut T) -> bool,
    {
        let original_len = self.len();
        // Avoid double drop if the drop guard is not executed,
        // since we may make some holes during the process.
        unsafe { self.set_len(0) };

        // Vec: [Kept, Kept, Hole, Hole, Hole, Hole, Unchecked, Unchecked]
        //      |<-              processed len   ->| ^- next to check
        //                  |<-  deleted cnt     ->|
        //      |<-              original_len                          ->|
        // Kept: Elements which predicate returns true on.
        // Hole: Moved or dropped element slot.
        // Unchecked: Unchecked valid elements.
        //
        // This drop guard will be invoked when predicate or `drop` of element panicked.
        // It shifts unchecked elements to cover holes and `set_len` to the correct length.
        // In cases when predicate and `drop` never panick, it will be optimized out.
        struct BackshiftOnDrop<'a, T, A: Allocator> {
            v: &'a mut Vec<T, A>,
            processed_len: usize,
            deleted_cnt: usize,
            original_len: usize,
        }

        impl<T, A: Allocator> Drop for BackshiftOnDrop<'_, T, A> {
            fn drop(&mut self) {
                if self.deleted_cnt > 0 {
                    // SAFETY: Trailing unchecked items must be valid since we never touch them.
                    unsafe {
                        ptr::copy(
                            self.v.as_ptr().add(self.processed_len),
                            self.v.as_mut_ptr().add(self.processed_len - self.deleted_cnt),
                            self.original_len - self.processed_len,
                        );
                    }
                }
                // SAFETY: After filling holes, all items are in contiguous memory.
                unsafe {
                    self.v.set_len(self.original_len - self.deleted_cnt);
                }
            }
        }

        let mut g = BackshiftOnDrop { v: self, processed_len: 0, deleted_cnt: 0, original_len };

        // check_one returns the current element if it was retained.
        #[inline(always)]
        fn check_one<F, T, A: Allocator>(
            f: &mut F,
            g: &mut BackshiftOnDrop<'_, T, A>,
        ) -> Option<*const T>
        where
            F: FnMut(&mut T) -> bool,
        {
            // SAFETY: Unchecked element must be valid.
            let cur = unsafe { &mut *g.v.as_mut_ptr().add(g.processed_len) };
            let retain = f(cur);
            // Advance early to avoid double drop if `drop_in_place` panics.
            g.processed_len += 1;
            if retain {
                Some(cur)
            } else {
                g.deleted_cnt += 1;
                // SAFETY: We never touch this element again after dropped.
                unsafe { ptr::drop_in_place(cur) };
                None
            }
        }

        // Stage 1: Nothing was deleted.
        while g.processed_len != original_len {
            if check_one(&mut f, &mut g).is_none() {
                break;
            }
        }

        // Stage 2: Some elements were deleted.
        while g.processed_len != original_len {
            if let Some(cur) = check_one(&mut f, &mut g) {
                // SAFETY: `deleted_cnt` > 0, so the hole slot must not overlap with current element.
                // We use copy for move, and never touch this element again.
                unsafe {
                    let hole_slot = g.v.as_mut_ptr().add(g.processed_len - 1 - g.deleted_cnt);
                    ptr::copy_nonoverlapping(cur, hole_slot, 1);
                }
            }
        }

        // All item are processed. This can be optimized to `set_len` by LLVM.
        drop(g);
    }

@bors
Copy link
Contributor

bors commented Dec 4, 2021

☀️ Try build successful - checks-actions
Build commit: 7c69e77aa6984d2d2e1aa554041d3fdf92ee19f5 (7c69e77aa6984d2d2e1aa554041d3fdf92ee19f5)

@rust-timer
Copy link
Collaborator

Queued 7c69e77aa6984d2d2e1aa554041d3fdf92ee19f5 with parent 887999d, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (7c69e77aa6984d2d2e1aa554041d3fdf92ee19f5): comparison url.

Summary: This benchmark run did not return any relevant changes.

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 4, 2021
@the8472 the8472 added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Dec 6, 2021
// SAFETY: We never touch this element again after dropped.
unsafe { ptr::drop_in_place(cur) };
// We already advanced the counter.
if DELETED {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write this more clearly? It is a quite complex, imho.

Maybe this would more readable:

match (DELETED, f(cur)){
    (true, true) =>{
          // move current item to freed hole
    },
    (true, false) => {
          // increment cnts, drop in place, 
          // move items
    }
    (false, true) => {
        // increase processed len, do not move because no items dropped yet
    }
    (false, false) => {
        // increase cnts, drop in place, return
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                // SAFETY: We never touch this element again after dropped.
                    g.deleted_cnt += 1;
                unsafe { ptr::drop_in_place(cur) };
                    // SAFETY: We never touch this element again after dropped.
                // We already advanced the counter.
                    unsafe { ptr::drop_in_place(cur) };

This code block would happen on two branches and thus would have to be duplicated.

Well, I guess processed_len can be hoisted out of all the conditionals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think that duplicate code is a big issue because it is very short. For me, control flow is not easy to follow but maybe it is something wrong with me and the code is OK.

// process_one return a bool indicates whether the processing element should be retained.
#[inline(always)]
fn process_one<F, T, A: Allocator, const DELETED: bool>(
fn process_loop<F, T, A: Allocator, const DELETED: bool>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, dropping #[inline] modifier should not be done before New Pass Manager stabilizes because this affects how early function would be inlined.
Also, fn drop can have this modifier too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I see that fn drop always do a memmove call despite the fact that almost always there is 0 items to copy (this is not true only if panic occurs).
LLVM doesn't do any checks before memmove call: godbolt

Maybe condition must look like if self.deleted_cnt > 0 && self.original_len != self.processed_len {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, dropping #[inline] modifier should not be done before New Pass Manager stabilizes because this affects how early function would be inlined.

Is it relevant though? I think most of the performance comes from the loops themselves being optimized properly, not from inlining the loops into the larger function. They only do unchecked pointer arithmetic anyway, so it's not like they benefit from bounds check elimination.

if DELETED {
continue;
} else {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return shows our intent more clearly.

@joshtriplett
Copy link
Member

r? @rust-lang/libs

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@dtolnay
Copy link
Member

dtolnay commented Dec 15, 2021

@bors r+

@bors
Copy link
Contributor

bors commented Dec 15, 2021

📌 Commit 67180ef has been approved by dtolnay

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 15, 2021
@bors
Copy link
Contributor

bors commented Dec 15, 2021

⌛ Testing commit 67180ef with merge 04a1625e6667c7a3e5e1a0621077d0053f09206c...

@bors
Copy link
Contributor

bors commented Dec 15, 2021

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Dec 15, 2021
@rust-log-analyzer
Copy link
Collaborator

The job x86_64-gnu-llvm-12 failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

---- [ui] ui/cmse-nonsecure/cmse-nonsecure-call/params-on-stack.rs stdout ----
diff of stderr:

1 error: <unknown>:0:0: in function test i32 (i32, i32, i32, i32, i32): call to non-secure function would require passing arguments on stack
+ thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
+   left: `LLVMing`,
+   left: `LLVMing`,
+  right: `Codegenning`', /checkout/compiler/rustc_codegen_ssa/src/back/write.rs:1464:21
+ note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
+ error: internal compiler error: unexpected panic
+ 
+ note: the compiler unexpectedly panicked. this is a bug.
+ 
+ 
Some tests failed in compiletest suite=ui mode=ui host=x86_64-unknown-linux-gnu target=x86_64-unknown-linux-gnu
+ note: we would appreciate a bug report: https://github.com/rust-lang/rust/issues/new?labels=C-bug%2C+I-ICE%2C+T-compiler&template=ice.md
+ 
+ note: rustc 1.59.0-nightly (04a1625e6 2021-12-15) running on x86_64-unknown-linux-gnu
+ 
+ note: compiler flags: -Z threads=1 -Z ui-testing -Z deduplicate-diagnostics=no -C codegen-units=1 -C prefer-dynamic -C rpath -C debuginfo=0 --crate-type lib
+ query stack during panic:
+ end of query stack
3 error: aborting due to previous error
4 
---
To only update this specific test, also pass `--test-args cmse-nonsecure/cmse-nonsecure-call/params-on-stack.rs`

error: 1 errors occurred comparing output.
status: exit status: 1
command: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "/checkout/src/test/ui/cmse-nonsecure/cmse-nonsecure-call/params-on-stack.rs" "-Zthreads=1" "--error-format" "json" "--json" "future-incompat" "-Ccodegen-units=1" "-Zui-testing" "-Zdeduplicate-diagnostics=no" "-C" "prefer-dynamic" "--out-dir" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/cmse-nonsecure/cmse-nonsecure-call/params-on-stack" "-A" "unused" "-Crpath" "-O" "-Cdebuginfo=0" "-Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target" "thumbv8m.main-none-eabi" "--crate-type" "lib" "-L" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/cmse-nonsecure/cmse-nonsecure-call/params-on-stack/auxiliary"
------------------------------------------

------------------------------------------
stderr:
stderr:
------------------------------------------
error: <unknown>:0:0: in function test i32 (i32, i32, i32, i32, i32): call to non-secure function would require passing arguments on stack
thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `LLVMing`,
  left: `LLVMing`,
 right: `Codegenning`', /checkout/compiler/rustc_codegen_ssa/src/back/write.rs:1464:21

error: internal compiler error: unexpected panic

note: the compiler unexpectedly panicked. this is a bug.
note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/issues/new?labels=C-bug%2C+I-ICE%2C+T-compiler&template=ice.md

note: rustc 1.59.0-nightly (04a1625e6 2021-12-15) running on x86_64-unknown-linux-gnu

note: compiler flags: -Z threads=1 -Z ui-testing -Z deduplicate-diagnostics=no -C codegen-units=1 -C prefer-dynamic -C rpath -C debuginfo=0 --crate-type lib
query stack during panic:
end of query stack
error: aborting due to previous error

---
test result: FAILED. 12367 passed; 1 failed; 119 ignored; 0 measured; 0 filtered out; finished in 150.47s



command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/compiletest" "--compile-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" "--run-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib" "--rustc-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "--src-base" "/checkout/src/test/ui" "--build-base" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui" "--stage-id" "stage2-x86_64-unknown-linux-gnu" "--suite" "ui" "--mode" "ui" "--target" "x86_64-unknown-linux-gnu" "--host" "x86_64-unknown-linux-gnu" "--llvm-filecheck" "/usr/lib/llvm-12/bin/FileCheck" "--nodejs" "/usr/bin/node" "--host-rustcflags" "-Crpath -O -Cdebuginfo=0  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target-rustcflags" "-Crpath -O -Cdebuginfo=0  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--docck-python" "/usr/bin/python3" "--lldb-python" "/usr/bin/python3" "--gdb" "/usr/bin/gdb" "--llvm-version" "12.0.0" "--llvm-components" "aarch64 aarch64asmparser aarch64codegen aarch64desc aarch64disassembler aarch64info aarch64utils aggressiveinstcombine all all-targets amdgpu amdgpuasmparser amdgpucodegen amdgpudesc amdgpudisassembler amdgpuinfo amdgpuutils analysis arm armasmparser armcodegen armdesc armdisassembler arminfo armutils asmparser asmprinter avr avrasmparser avrcodegen avrdesc avrdisassembler avrinfo binaryformat bitreader bitstreamreader bitwriter bpf bpfasmparser bpfcodegen bpfdesc bpfdisassembler bpfinfo cfguard codegen core coroutines coverage debuginfocodeview debuginfodwarf debuginfogsym debuginfomsf debuginfopdb demangle dlltooldriver dwarflinker engine executionengine extensions filecheck frontendopenacc frontendopenmp fuzzmutate globalisel hellonew hexagon hexagonasmparser hexagoncodegen hexagondesc hexagondisassembler hexagoninfo instcombine instrumentation interfacestub interpreter ipo irreader jitlink lanai lanaiasmparser lanaicodegen lanaidesc lanaidisassembler lanaiinfo libdriver lineeditor linker lto mc mca mcdisassembler mcjit mcparser mips mipsasmparser mipscodegen mipsdesc mipsdisassembler mipsinfo mirparser msp430 msp430asmparser msp430codegen msp430desc msp430disassembler msp430info native nativecodegen nvptx nvptxcodegen nvptxdesc nvptxinfo objcarcopts object objectyaml option orcjit orcshared orctargetprocess passes perfjitevents powerpc powerpcasmparser powerpccodegen powerpcdesc powerpcdisassembler powerpcinfo profiledata remarks riscv riscvasmparser riscvcodegen riscvdesc riscvdisassembler riscvinfo runtimedyld scalaropts selectiondag sparc sparcasmparser sparccodegen sparcdesc sparcdisassembler sparcinfo support symbolize systemz systemzasmparser systemzcodegen systemzdesc systemzdisassembler systemzinfo tablegen target textapi transformutils vectorize webassembly webassemblyasmparser webassemblycodegen webassemblydesc webassemblydisassembler webassemblyinfo windowsmanifest x86 x86asmparser x86codegen x86desc x86disassembler x86info xcore xcorecodegen xcoredesc xcoredisassembler xcoreinfo xray" "--system-llvm" "--cc" "" "--cxx" "" "--cflags" "" "--adb-path" "adb" "--adb-test-dir" "/data/tmp/work" "--android-cross-path" "" "--channel" "nightly" "--color" "always"


Build completed unsuccessfully in 0:14:49

@the8472
Copy link
Member Author

the8472 commented Dec 15, 2021

Seems to be a spurious failure that occasionally also hits other PRs.

@bors retry

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 15, 2021
@bors
Copy link
Contributor

bors commented Dec 16, 2021

⌛ Testing commit 67180ef with merge a090c86...

@bors
Copy link
Contributor

bors commented Dec 16, 2021

☀️ Test successful - checks-actions
Approved by: dtolnay
Pushing a090c86 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 16, 2021
@bors bors merged commit a090c86 into rust-lang:master Dec 16, 2021
@rustbot rustbot added this to the 1.59.0 milestone Dec 16, 2021
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (a090c86): comparison url.

Summary: This change led to large relevant mixed results 🤷 in compiler performance.

  • Large improvement in instruction counts (up to -4.4% on incr-unchanged builds of unicode_normalization)
  • Large regression in instruction counts (up to 4.1% on incr-patched: println builds of regression-31157)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

@rustbot rustbot added the perf-regression Performance regression. label Dec 16, 2021
@rylev
Copy link
Member

rylev commented Dec 21, 2021

This seems likely to be noise as a result of the issue tracked in rust-lang/rustc-perf#1126 -- marking as triaged, not something we should address directly.

@rustbot label +perf-regression-triaged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vec::retain() is significantly slower than into_iter().filter().collect()