Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize iter::Chain<A, B>::next when A==B #107701

Closed
wants to merge 6 commits into from

Conversation

the8472
Copy link
Member

@the8472 the8472 commented Feb 5, 2023

This improves external iteration for Chain where both sides have the same type.

 iter::bench_chain_partial_cmp         337,514         338,658                  1,144    0.34%   x 1.00
 iter::bench_enumerate_chain_ref_sum   3,834,642       1,911,073           -1,923,569  -50.16%   x 2.01
 iter::bench_enumerate_chain_sum       2,393,149       2,405,145               11,996    0.50%   x 1.00
 iter::bench_filter_chain_count        1,769,977       2,020,882              250,905   14.18%   x 0.88
 iter::bench_filter_chain_ref_count    3,961,136       1,911,154           -2,049,982  -51.75%   x 2.07
 iter::bench_filter_chain_ref_sum      1,617,944       1,683,379               65,435    4.04%   x 0.96
 iter::bench_filter_chain_sum          1,663,256       2,104,894              441,638   26.55%   x 0.79
 iter::bench_filter_map_chain_ref_sum  3,646,401       3,638,193               -8,208   -0.23%   x 1.00
 iter::bench_filter_map_chain_sum      1,663,237       1,901,151              237,914   14.30%   x 0.87
 iter::bench_flat_map_chain_ref_sum    12,055,514      13,475,988           1,420,474   11.78%   x 0.89
 iter::bench_flat_map_chain_sum        1,900,258       1,912,229               11,971    0.63%   x 0.99
 iter::bench_for_each_chain_fold       2,205,688       2,218,132               12,444    0.56%   x 0.99
 iter::bench_for_each_chain_loop       4,110,058       2,471,585           -1,638,473  -39.86%   x 1.66
 iter::bench_for_each_chain_ref_fold   4,169,257       1,434,176           -2,735,081  -65.60%   x 2.91
 iter::bench_fuse_chain_ref_sum        3,811,904       3,874,261               62,357    1.64%   x 0.98
 iter::bench_fuse_chain_sum            473,010         476,363                  3,353    0.71%   x 0.99
 iter::bench_inspect_chain_ref_sum     3,981,137       1,075,046           -2,906,091  -73.00%   x 3.70
 iter::bench_inspect_chain_sum         474,636         475,986                  1,350    0.28%   x 1.00
 iter::bench_peekable_chain_ref_sum    3,932,644       1,433,588           -2,499,056  -63.55%   x 2.74
 iter::bench_peekable_chain_sum        475,751         475,416                   -335   -0.07%   x 1.00
 iter::bench_skip_chain_ref_sum        3,930,088       1,433,293           -2,496,795  -63.53%   x 2.74
 iter::bench_skip_chain_sum            478,383         474,852                 -3,531   -0.74%   x 1.01
 iter::bench_skip_while_chain_ref_sum  4,634,693       4,483,621             -151,072   -3.26%   x 1.03
 iter::bench_skip_while_chain_sum      477,349         464,687                -12,662   -2.65%   x 1.03
 iter::bench_slice_chain_ref_sum       842             492                       -350  -41.57%   x 1.71
 iter::bench_slice_chain_sum           443             338                       -105  -23.70%   x 1.31
 iter::bench_take_while_chain_ref_sum  2,194,655       2,126,616              -68,039   -3.10%   x 1.03
 iter::bench_take_while_chain_sum      1,732,017       1,213,209             -518,808  -29.95%   x 1.43

Similar to the optimization in vec_deque::Iter::next

@rustbot
Copy link
Collaborator

rustbot commented Feb 5, 2023

r? @cuviper

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 5, 2023
@rustbot

This comment was marked as resolved.

@the8472
Copy link
Member Author

the8472 commented Feb 5, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 5, 2023
@bors
Copy link
Contributor

bors commented Feb 5, 2023

⌛ Trying commit 68ce47f with merge 6b2bec53c7983506be629a98e73fc510a2eb956f...

@Sp00ph
Copy link
Member

Sp00ph commented Feb 5, 2023

Fat fingered Ctrl+Enter on my previous comment 😅 . Very interesting that apparently for slice iters the old approach was up to 4x slower than this swapping approach.

@bors
Copy link
Contributor

bors commented Feb 5, 2023

☀️ Try build successful - checks-actions
Build commit: 6b2bec53c7983506be629a98e73fc510a2eb956f (6b2bec53c7983506be629a98e73fc510a2eb956f)

1 similar comment
@bors
Copy link
Contributor

bors commented Feb 5, 2023

☀️ Try build successful - checks-actions
Build commit: 6b2bec53c7983506be629a98e73fc510a2eb956f (6b2bec53c7983506be629a98e73fc510a2eb956f)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (6b2bec53c7983506be629a98e73fc510a2eb956f): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
5.8% [1.7%, 11.1%] 6
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.0% [2.0%, 2.0%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.0% [2.0%, 2.0%] 1

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
6.2% [2.1%, 10.8%] 5
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 6, 2023
@cuviper
Copy link
Member

cuviper commented Feb 18, 2023

It's disappointing that the huge benchmark improvement didn't bear out in the perf results -- maybe we're too good at using internal iteration for our chains. The regressed deeply-nested-multi has a 16-deep Chain, so I guess it makes sense that there's more work now to compile that, but at least it doesn't appear to have an exponential scaling problem like that test was originally written for.

@cuviper
Copy link
Member

cuviper commented Feb 21, 2023

Similar to the optimization in vec_deque::Iter::next

How about a similar next_back as well?

@the8472
Copy link
Member Author

the8472 commented Feb 22, 2023

How about a similar next_back as well?

done

@cuviper
Copy link
Member

cuviper commented Feb 22, 2023

I tried to reproduce your benchmark results, but it doesn't look as good here:

$ cargo benchcmp --threshold 2 bench1-7a98053 bench4-beb9614
 name                                  bench1-7a98053 ns/iter  bench4-beb9614 ns/iter  diff ns/iter   diff %  speedup
 iter::bench_enumerate_chain_ref_sum   1,234,866               1,699,550                    464,684   37.63%   x 0.73
 iter::bench_filter_chain_ref_count    1,305,865               1,797,868                    492,003   37.68%   x 0.73
 iter::bench_filter_map_chain_sum      881,822                 1,047,292                    165,470   18.76%   x 0.84
 iter::bench_for_each_chain_loop       1,284,725               1,696,164                    411,439   32.03%   x 0.76
 iter::bench_for_each_chain_ref_fold   1,285,018               1,696,214                    411,196   32.00%   x 0.76
 iter::bench_fuse_chain_ref_sum        1,712,768               2,150,406                    437,638   25.55%   x 0.80
 iter::bench_inspect_chain_ref_sum     1,226,965               1,696,249                    469,284   38.25%   x 0.72
 iter::bench_peekable_chain_ref_sum    1,208,392               1,696,345                    487,953   40.38%   x 0.71
 iter::bench_skip_chain_ref_sum        1,284,179               1,695,163                    410,984   32.00%   x 0.76
 iter::bench_skip_chain_sum            635,490                 427,988                     -207,502  -32.65%   x 1.48
 iter::bench_slice_chain_ref_sum       483                     310                             -173  -35.82%   x 1.56
 iter::bench_slice_chain_sum           270                     362                               92   34.07%   x 0.75
 iter::bench_take_while_chain_ref_sum  914,385                 714,054                     -200,331  -21.91%   x 1.28
 iter::bench_take_while_chain_sum      235,815                 451,774                      215,959   91.58%   x 0.52

Any idea why we would see it so differently?

One possibility is that I have [rust] codegen-units-std = 1, similar to the official rustup builds, which can definitely help optimizations. I wonder if the swaps hurt LLVM's view of this code.

@the8472
Copy link
Member Author

the8472 commented Feb 22, 2023

I went through several combinations of flags and can't repro your result. I'm running stage0 benchmarks.

incremental=false, lld, 1CGU, lto=fat, target-cpu=native (znver2), debuginfo=2
 name                                  base.b ns/iter  branch.b ns/iter  diff ns/iter   diff %  speedup 
 iter::bench_enumerate_chain_ref_sum   3,812,059       2,136,600           -1,675,459  -43.95%   x 1.78 
 iter::bench_filter_chain_count        1,052,272       978,484                -73,788   -7.01%   x 1.08 
 iter::bench_filter_chain_ref_count    3,861,390       2,494,123           -1,367,267  -35.41%   x 1.55 
 iter::bench_filter_map_chain_ref_sum  2,500,364       2,198,953             -301,411  -12.05%   x 1.14 
 iter::bench_filter_map_chain_sum      1,436,488       950,107               -486,381  -33.86%   x 1.51 
 iter::bench_flat_map_chain_ref_sum    5,034,028       6,778,491            1,744,463   34.65%   x 0.74 
 iter::bench_flat_map_chain_sum        479,411         463,715                -15,696   -3.27%   x 1.03 
 iter::bench_for_each_chain_fold       475,367         691,165                215,798   45.40%   x 0.69 
 iter::bench_for_each_chain_loop       3,846,909       1,847,520           -1,999,389  -51.97%   x 2.08 
 iter::bench_for_each_chain_ref_fold   3,978,042       2,046,451           -1,931,591  -48.56%   x 1.94 
 iter::bench_fuse_chain_ref_sum        3,830,958       1,900,582           -1,930,376  -50.39%   x 2.02 
 iter::bench_fuse_chain_sum            475,206         708,282                233,076   49.05%   x 0.67 
 iter::bench_inspect_chain_ref_sum     3,980,442       2,052,716           -1,927,726  -48.43%   x 1.94 
 iter::bench_inspect_chain_sum         713,546         474,183               -239,363  -33.55%   x 1.50 
 iter::bench_peekable_chain_ref_sum    3,992,089       2,055,739           -1,936,350  -48.50%   x 1.94 
 iter::bench_peekable_chain_sum        712,646         472,321               -240,325  -33.72%   x 1.51 
 iter::bench_skip_chain_ref_sum        3,987,803       2,370,522           -1,617,281  -40.56%   x 1.68 
 iter::bench_skip_while_chain_sum      474,474         712,438                237,964   50.15%   x 0.67 
 iter::bench_slice_chain_ref_sum       571             410                       -161  -28.20%   x 1.39 
 iter::bench_slice_chain_sum           472             484                         12    2.54%   x 0.98 
 iter::bench_take_while_chain_ref_sum  1,059,606       928,730               -130,876  -12.35%   x 1.14 
 iter::bench_take_while_chain_sum      261,789         501,199                239,410   91.45%   x 0.52 
incremental=false, lld, 1CGU, lto=fat, target-cpu=default, debuginfo=2
 name                                  base.b ns/iter  branch.b ns/iter  diff ns/iter   diff %  speedup 
 iter::bench_chain_partial_cmp         101,526         98,963                  -2,563   -2.52%   x 1.03 
 iter::bench_enumerate_chain_ref_sum   3,819,937       2,088,350           -1,731,587  -45.33%   x 1.83 
 iter::bench_enumerate_chain_sum       909,045         932,297                 23,252    2.56%   x 0.98 
 iter::bench_filter_chain_count        1,072,605       953,229               -119,376  -11.13%   x 1.13 
 iter::bench_filter_chain_ref_count    3,963,479       2,322,925           -1,640,554  -41.39%   x 1.71 
 iter::bench_filter_chain_ref_sum      1,711,480       1,612,636              -98,844   -5.78%   x 1.06 
 iter::bench_filter_chain_sum          1,258,180       1,135,219             -122,961   -9.77%   x 1.11 
 iter::bench_filter_map_chain_ref_sum  2,146,493       2,223,921               77,428    3.61%   x 0.97 
 iter::bench_flat_map_chain_ref_sum    4,753,742       6,936,048            2,182,306   45.91%   x 0.69 
 iter::bench_for_each_chain_fold       472,987         695,519                222,532   47.05%   x 0.68 
 iter::bench_for_each_chain_loop       3,872,177       1,856,980           -2,015,197  -52.04%   x 2.09 
 iter::bench_for_each_chain_ref_fold   3,907,860       2,009,307           -1,898,553  -48.58%   x 1.94 
 iter::bench_fuse_chain_ref_sum        3,814,276       2,320,439           -1,493,837  -39.16%   x 1.64 
 iter::bench_inspect_chain_ref_sum     3,889,967       2,010,567           -1,879,400  -48.31%   x 1.93 
 iter::bench_inspect_chain_sum         473,616         463,739                 -9,877   -2.09%   x 1.02 
 iter::bench_peekable_chain_ref_sum    3,902,686       2,012,245           -1,890,441  -48.44%   x 1.94 
 iter::bench_peekable_chain_sum        472,299         694,289                221,990   47.00%   x 0.68 
 iter::bench_skip_chain_ref_sum        3,909,425       2,083,966           -1,825,459  -46.69%   x 1.88 
 iter::bench_skip_chain_sum            709,471         463,708               -245,763  -34.64%   x 1.53 
 iter::bench_skip_while_chain_sum      473,519         695,629                222,110   46.91%   x 0.68 
 iter::bench_slice_chain_ref_sum       573             426                       -147  -25.65%   x 1.35 
 iter::bench_slice_chain_sum           534             517                        -17   -3.18%   x 1.03 
 iter::bench_take_while_chain_ref_sum  924,697         1,056,233              131,536   14.22%   x 0.88 
 iter::bench_take_while_chain_sum      263,289         500,973                237,684   90.27%   x 0.53 
incremental=false, lld, 1CGU, lto=default, target-cpu=default, debuginfo=2
 name                                  base.b ns/iter  branch.b ns/iter  diff ns/iter   diff %  speedup 
 iter::bench_enumerate_chain_ref_sum   3,823,719       2,139,579           -1,684,140  -44.04%   x 1.79 
 iter::bench_enumerate_chain_sum       907,408         952,363                 44,955    4.95%   x 0.95 
 iter::bench_filter_chain_count        1,067,514       978,697                -88,817   -8.32%   x 1.09 
 iter::bench_filter_chain_ref_count    3,811,390       2,375,711           -1,435,679  -37.67%   x 1.60 
 iter::bench_filter_chain_ref_sum      1,586,041       1,643,702               57,661    3.64%   x 0.96 
 iter::bench_filter_chain_sum          1,546,486       1,347,739             -198,747  -12.85%   x 1.15 
 iter::bench_filter_map_chain_ref_sum  2,199,498       2,282,112               82,614    3.76%   x 0.96 
 iter::bench_flat_map_chain_ref_sum    4,892,679       7,087,349            2,194,670   44.86%   x 0.69 
 iter::bench_flat_map_chain_sum        475,088         485,583                 10,495    2.21%   x 0.98 
 iter::bench_for_each_chain_fold       473,205         710,934                237,729   50.24%   x 0.67 
 iter::bench_for_each_chain_loop       3,859,384       1,899,192           -1,960,192  -50.79%   x 2.03 
 iter::bench_for_each_chain_ref_fold   3,905,280       2,051,683           -1,853,597  -47.46%   x 1.90 
 iter::bench_fuse_chain_ref_sum        3,814,542       2,320,531           -1,494,011  -39.17%   x 1.64 
 iter::bench_fuse_chain_sum            711,579         695,370                -16,209   -2.28%   x 1.02 
 iter::bench_inspect_chain_ref_sum     3,916,714       2,038,120           -1,878,594  -47.96%   x 1.92 
 iter::bench_inspect_chain_sum         474,534         582,596                108,062   22.77%   x 0.81 
 iter::bench_peekable_chain_ref_sum    3,943,346       2,004,815           -1,938,531  -49.16%   x 1.97 
 iter::bench_peekable_chain_sum        477,022         712,230                235,208   49.31%   x 0.67 
 iter::bench_skip_chain_ref_sum        3,928,437       2,084,851           -1,843,586  -46.93%   x 1.88 
 iter::bench_skip_chain_sum            713,096         471,669               -241,427  -33.86%   x 1.51 
 iter::bench_skip_while_chain_sum      474,148         711,859                237,711   50.13%   x 0.67 
 iter::bench_slice_chain_ref_sum       569             423                       -146  -25.66%   x 1.35 
 iter::bench_slice_chain_sum           535             513                        -22   -4.11%   x 1.04 
 iter::bench_take_while_chain_ref_sum  923,607         1,055,911              132,304   14.32%   x 0.87 
 iter::bench_take_while_chain_sum      262,783         501,484                238,701   90.84%   x 0.52 
incremental=true, lld, 1CGU, lto=default, target-cpu=default, debuginfo=2
 name                                  base.b ns/iter  branch.b ns/iter  diff ns/iter   diff %  speedup 
 iter::bench_enumerate_chain_ref_sum   3,818,018       2,138,539           -1,679,479  -43.99%   x 1.79 
 iter::bench_enumerate_chain_sum       909,676         952,527                 42,851    4.71%   x 0.96 
 iter::bench_filter_chain_ref_count    3,902,358       2,374,957           -1,527,401  -39.14%   x 1.64 
 iter::bench_filter_chain_ref_sum      1,677,490       1,712,314               34,824    2.08%   x 0.98 
 iter::bench_flat_map_chain_ref_sum    4,886,916       6,985,085            2,098,169   42.93%   x 0.70 
 iter::bench_for_each_chain_fold       475,880         710,765                234,885   49.36%   x 0.67 
 iter::bench_for_each_chain_loop       3,887,758       1,897,118           -1,990,640  -51.20%   x 2.05 
 iter::bench_for_each_chain_ref_fold   3,934,603       2,052,074           -1,882,529  -47.85%   x 1.92 
 iter::bench_fuse_chain_ref_sum        3,854,684       2,391,427           -1,463,257  -37.96%   x 1.61 
 iter::bench_fuse_chain_sum            712,465         473,347               -239,118  -33.56%   x 1.51 
 iter::bench_inspect_chain_ref_sum     3,909,068       2,058,387           -1,850,681  -47.34%   x 1.90 
 iter::bench_peekable_chain_ref_sum    3,910,611       2,062,093           -1,848,518  -47.27%   x 1.90 
 iter::bench_peekable_chain_sum        473,235         711,775                238,540   50.41%   x 0.66 
 iter::bench_skip_chain_ref_sum        3,908,637       2,131,494           -1,777,143  -45.47%   x 1.83 
 iter::bench_slice_chain_ref_sum       509             425                        -84  -16.50%   x 1.20 
 iter::bench_slice_chain_sum           517             536                         19    3.68%   x 0.96 
 iter::bench_take_while_chain_ref_sum  1,293,233       933,467               -359,766  -27.82%   x 1.39 
 iter::bench_take_while_chain_sum      290,205         501,105                210,900   72.67%   x 0.58 
incremental=true, lld, CGUs=default (256), lto=default, target-cpu=default, debuginfo=2
 name                                  base.b ns/iter  branch.b ns/iter  diff ns/iter   diff %  speedup 
 iter::bench_enumerate_chain_ref_sum   3,839,242       1,915,007           -1,924,235  -50.12%   x 2.00 
 iter::bench_filter_chain_ref_count    3,911,678       1,912,233           -1,999,445  -51.11%   x 2.05 
 iter::bench_flat_map_chain_ref_sum    12,033,827      13,600,592           1,566,765   13.02%   x 0.88 
 iter::bench_for_each_chain_loop       4,004,310       2,479,911           -1,524,399  -38.07%   x 1.61 
 iter::bench_for_each_chain_ref_fold   3,923,257       1,900,377           -2,022,880  -51.56%   x 2.06 
 iter::bench_fuse_chain_ref_sum        3,744,357       3,871,837              127,480    3.40%   x 0.97 
 iter::bench_fuse_chain_sum            464,936         479,030                 14,094    3.03%   x 0.97 
 iter::bench_inspect_chain_ref_sum     3,831,433       1,114,836           -2,716,597  -70.90%   x 3.44 
 iter::bench_peekable_chain_ref_sum    3,893,717       1,433,334           -2,460,383  -63.19%   x 2.72 
 iter::bench_skip_chain_ref_sum        3,896,135       1,113,843           -2,782,292  -71.41%   x 3.50 
 iter::bench_skip_while_chain_ref_sum  4,698,896       4,889,215              190,319    4.05%   x 0.96 
 iter::bench_slice_chain_ref_sum       839             506                       -333  -39.69%   x 1.66 

Maybe it's something CPU-specific, I'll run a single benchmark under perf stat tomorrow to see what makes it faster

@cuviper
Copy link
Member

cuviper commented Feb 22, 2023

I'm running stage0 benchmarks.

I'm running the default (stage 1) on Fedora 37, Ryzen 7 5800X. My config:

# Includes one of the default files in src/bootstrap/defaults
profile = "compiler"
changelog-seen = 2

[rust]
debug-assertions = true
codegen-units-std = 1
verbose-tests = true

[llvm]
assertions = true

I tried without assertions too, and it didn't look much different.

@cuviper
Copy link
Member

cuviper commented Feb 23, 2023

Here's what I got for the same config on --stage 0:

$ cargo benchcmp --threshold 2 stage0_*
 name                                 stage0_1-7a98053 ns/iter  stage0_4-beb9614 ns/iter  diff ns/iter   diff %  speedup
 iter::bench_chain_partial_cmp        74,959                    63,734                         -11,225  -14.97%   x 1.18
 iter::bench_enumerate_chain_ref_sum  1,201,570                 2,087,672                      886,102   73.75%   x 0.58
 iter::bench_enumerate_chain_sum      668,359                   758,861                         90,502   13.54%   x 0.88
 iter::bench_filter_chain_ref_count   1,306,004                 1,793,094                      487,090   37.30%   x 0.73
 iter::bench_filter_map_chain_sum     1,038,151                 891,097                       -147,054  -14.16%   x 1.17
 iter::bench_flat_map_chain_ref_sum   6,476,104                 5,009,639                   -1,466,465  -22.64%   x 1.29
 iter::bench_for_each_chain_loop      1,284,855                 1,475,270                      190,415   14.82%   x 0.87
 iter::bench_for_each_chain_ref_fold  1,217,257                 1,678,344                      461,087   37.88%   x 0.73
 iter::bench_fuse_chain_ref_sum       2,132,799                 1,712,501                     -420,298  -19.71%   x 1.25
 iter::bench_inspect_chain_ref_sum    1,453,806                 1,699,276                      245,470   16.88%   x 0.86
 iter::bench_peekable_chain_ref_sum   1,233,076                 1,677,335                      444,259   36.03%   x 0.74
 iter::bench_skip_chain_ref_sum       1,459,566                 1,690,239                      230,673   15.80%   x 0.86
 iter::bench_slice_chain_ref_sum      491                       303                               -188  -38.29%   x 1.62
 iter::bench_take_while_chain_sum     449,027                   258,876                       -190,151  -42.35%   x 1.73

@shepmaster
Copy link
Member

Maybe it's something CPU-specific

FWIW, when @seritools and I were working on the issue that spawned this idea, we were also doing some benchmarks (I have an Apple M1 Max, they have an Intel i7 12700k) and the same code had pretty drastic benchmark differences, so I would encourage investigating this avenue.

@Sp00ph
Copy link
Member

Sp00ph commented Feb 23, 2023

I've heard before that the AMD and Apple branch predictors aren't as good as Intel's. Given that this optimization is all about making the first branch as hot as possible, could it just be that CPUs with worse branch predictors can take less advantage of the better predictability?

@cuviper
Copy link
Member

cuviper commented Feb 23, 2023

For an apples-to-apples comparison, here are my x86_64-unknown-linux-gnu binaries:
rust-107701-corebenches.tar.gz

And here are those exact binaries compared on a few machines that I have readily available:

AMD Ryzen 7 5800X, Fedora 37
 iter::bench_enumerate_chain_ref_sum   1,242,110                   1,679,974                        437,864   35.25%   x 0.74
 iter::bench_filter_chain_ref_count    1,315,806                   1,771,386                        455,580   34.62%   x 0.74
 iter::bench_filter_map_chain_sum      890,763                     1,028,536                        137,773   15.47%   x 0.87
 iter::bench_flat_map_chain_ref_sum    4,974,809                   4,668,213                       -306,596   -6.16%   x 1.07
 iter::bench_flat_map_chain_sum        432,048                     423,241                           -8,807   -2.04%   x 1.02
 iter::bench_for_each_chain_loop       1,295,276                   1,676,095                        380,819   29.40%   x 0.77
 iter::bench_for_each_chain_ref_fold   1,296,301                   1,671,305                        375,004   28.93%   x 0.78
 iter::bench_fuse_chain_ref_sum        1,727,381                   2,116,009                        388,628   22.50%   x 0.82
 iter::bench_inspect_chain_ref_sum     1,239,733                   1,672,703                        432,970   34.92%   x 0.74
 iter::bench_peekable_chain_ref_sum    1,228,759                   1,672,130                        443,371   36.08%   x 0.73
 iter::bench_skip_chain_ref_sum        1,296,042                   1,670,282                        374,240   28.88%   x 0.78
 iter::bench_skip_chain_sum            641,998                     421,507                         -220,491  -34.34%   x 1.52
 iter::bench_slice_chain_ref_sum       490                         302                                 -188  -38.37%   x 1.62
 iter::bench_slice_chain_sum           273                         353                                   80   29.30%   x 0.77
 iter::bench_take_while_chain_ref_sum  923,947                     704,491                         -219,456  -23.75%   x 1.31
 iter::bench_take_while_chain_sum      237,742                     445,480                          207,738   87.38%   x 0.53
Intel Core i7-7700K, Fedora 37
 iter::bench_chain_partial_cmp         111,704                     92,251                           -19,453  -17.41%   x 1.21
 iter::bench_enumerate_chain_ref_sum   2,154,102                   3,117,171                        963,069   44.71%   x 0.69
 iter::bench_filter_chain_ref_count    2,348,788                   2,783,466                        434,678   18.51%   x 0.84
 iter::bench_flat_map_chain_ref_sum    5,655,187                   5,389,638                       -265,549   -4.70%   x 1.05
 iter::bench_flat_map_chain_sum        487,092                     668,009                          180,917   37.14%   x 0.73
 iter::bench_for_each_chain_loop       2,678,809                   3,117,192                        438,383   16.36%   x 0.86
 iter::bench_for_each_chain_ref_fold   2,456,351                   2,672,375                        216,024    8.79%   x 0.92
 iter::bench_fuse_chain_ref_sum        3,347,854                   3,118,133                       -229,721   -6.86%   x 1.07
 iter::bench_inspect_chain_ref_sum     2,221,556                   2,375,563                        154,007    6.93%   x 0.94
 iter::bench_peekable_chain_ref_sum    2,449,542                   2,672,033                        222,491    9.08%   x 0.92
 iter::bench_skip_chain_ref_sum        2,449,493                   3,115,787                        666,294   27.20%   x 0.79
 iter::bench_skip_chain_sum            890,173                     631,448                         -258,725  -29.06%   x 1.41
 iter::bench_skip_while_chain_ref_sum  3,116,149                   3,338,600                        222,451    7.14%   x 0.93
 iter::bench_slice_chain_ref_sum       808                         472                                 -336  -41.58%   x 1.71
 iter::bench_slice_chain_sum           376                         433                                   57   15.16%   x 0.87
 iter::bench_take_while_chain_ref_sum  1,707,072                   1,657,705                        -49,367   -2.89%   x 1.03
 iter::bench_take_while_chain_sum      350,676                     494,813                          144,137   41.10%   x 0.71
Intel Core i7-9850H, CentOS Stream 9
 iter::bench_chain_partial_cmp         111,676                     99,085                           -12,591  -11.27%   x 1.13
 iter::bench_enumerate_chain_ref_sum   2,130,345                   3,133,442                      1,003,097   47.09%   x 0.68
 iter::bench_filter_chain_ref_count    2,355,757                   2,818,541                        462,784   19.64%   x 0.84
 iter::bench_flat_map_chain_ref_sum    5,666,474                   5,492,319                       -174,155   -3.07%   x 1.03
 iter::bench_flat_map_chain_sum        458,428                     671,982                          213,554   46.58%   x 0.68
 iter::bench_for_each_chain_loop       2,683,672                   3,154,062                        470,390   17.53%   x 0.85
 iter::bench_for_each_chain_ref_fold   2,463,276                   2,687,455                        224,179    9.10%   x 0.92
 iter::bench_fuse_chain_ref_sum        3,363,454                   3,129,435                       -234,019   -6.96%   x 1.07
 iter::bench_inspect_chain_ref_sum     2,230,394                   2,390,691                        160,297    7.19%   x 0.93
 iter::bench_peekable_chain_ref_sum    2,485,668                   2,683,189                        197,521    7.95%   x 0.93
 iter::bench_skip_chain_ref_sum        2,466,171                   3,128,535                        662,364   26.86%   x 0.79
 iter::bench_skip_chain_sum            895,557                     644,451                         -251,106  -28.04%   x 1.39
 iter::bench_skip_while_chain_ref_sum  3,132,032                   3,359,521                        227,489    7.26%   x 0.93
 iter::bench_slice_chain_ref_sum       813                         473                                 -340  -41.82%   x 1.72
 iter::bench_slice_chain_sum           382                         447                                   65   17.02%   x 0.85
 iter::bench_take_while_chain_ref_sum  1,727,334                   1,664,932                        -62,402   -3.61%   x 1.04
 iter::bench_take_while_chain_sum      356,135                     495,664                          139,529   39.18%   x 0.72
Intel Xeon Platinum 8370C, Ubuntu 22.04 (dev-desktop-us-2.infra.rust-lang.org)
 iter::bench_enumerate_chain_ref_sum   3,951,511                   3,209,429                       -742,082  -18.78%   x 1.23
 iter::bench_filter_chain_ref_count    4,490,260                   3,519,962                       -970,298  -21.61%   x 1.28
 iter::bench_flat_map_chain_ref_sum    5,662,002                   6,968,453                      1,306,451   23.07%   x 0.81
 iter::bench_for_each_chain_fold       1,017,147                   1,056,595                         39,448    3.88%   x 0.96
 iter::bench_for_each_chain_loop       4,682,939                   2,768,332                     -1,914,607  -40.88%   x 1.69
 iter::bench_for_each_chain_ref_fold   2,583,938                   2,998,971                        415,033   16.06%   x 0.86
 iter::bench_fuse_chain_ref_sum        5,504,732                   3,561,149                     -1,943,583  -35.31%   x 1.55
 iter::bench_peekable_chain_ref_sum    3,325,426                   2,999,057                       -326,369   -9.81%   x 1.11
 iter::bench_skip_chain_ref_sum        2,603,669                   3,458,675                        855,006   32.84%   x 0.75
 iter::bench_skip_chain_sum            1,200,017                   1,016,185                       -183,832  -15.32%   x 1.18
 iter::bench_skip_while_chain_ref_sum  3,198,618                   4,292,133                      1,093,515   34.19%   x 0.75
 iter::bench_skip_while_chain_sum      1,044,122                   1,016,213                        -27,909   -2.67%   x 1.03
 iter::bench_slice_chain_ref_sum       830                         590                                 -240  -28.92%   x 1.41
 iter::bench_slice_chain_sum           692                         717                                   25    3.61%   x 0.97
 iter::bench_take_while_chain_ref_sum  1,483,548                   1,543,175                         59,627    4.02%   x 0.96
 iter::bench_take_while_chain_sum      564,679                     748,007                          183,328   32.47%   x 0.75

The last one has the most favorable results, but still pretty mixed.

@the8472
Copy link
Member Author

the8472 commented Feb 23, 2023

This is on a Zen2 3960X, this time with stage1, 1CGU:

base:

perf stat
running 1 test
test iter::bench_skip_chain_ref_sum                                ... bench:   3,902,217 ns/iter (+/- 151,106)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 411 filtered out; finished in 3.55s

	finished in 3.694 seconds
Build completed successfully in 0:00:04

 Performance counter stats for './x bench --keep-stage 0 --stage 1 library/core/ --test-args iter::bench_skip_chain_ref_sum':

          4,486.33 msec task-clock                       #    1.001 CPUs utilized          
               678      context-switches                 #  151.126 /sec                   
               152      cpu-migrations                   #   33.881 /sec                   
            72,849      page-faults                      #   16.238 K/sec                  
    17,260,463,951      cycles                           #    3.847 GHz                      (84.07%)
       238,723,965      stalled-cycles-frontend          #    1.38% frontend cycles idle     (84.13%)
    11,449,836,029      stalled-cycles-backend           #   66.34% backend cycles idle      (83.68%)
    37,087,407,432      instructions                     #    2.15  insn per cycle         
                                                  #    0.31  stalled cycles per insn  (83.83%)
     6,130,965,765      branches                         #    1.367 G/sec                    (84.06%)
        14,031,568      branch-misses                    #    0.23% of all branches          (84.13%)

       4.482230133 seconds time elapsed

       4.198007000 seconds user
       0.281285000 seconds sys
perf report (cycles)
       │     <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next:                                                                                                                                                                                                                              ▒
       │     ↑ je        500                                                                                                                                                                                                                                                                                               ▒
       │       xor       %r8d,%r8d                                                                                                                                                                                                                                                                                         ▒
       │     ↓ jmp       5b0                                                                                                                                                                                                                                                                                               ▒
       │     core::num::<impl u64>::unchecked_add:                                                                                                                                                                                                                                                                         ▒
       │       nop                                                                                                                                                                                                                                                                                                         ▒
       │5a0:┌─→inc       %r8                                                                                                                                                                                                                                                                                               ▒
       │    │core::hint::black_box:                                                                                                                                                                                                                                                                                        ▒
       │    │  mov       %r9,(%rsp)                                                                                                                                                                                                                                                                                        ▒
       │    │core::cmp::impls::<impl core::cmp::PartialOrd for u64>::lt:                                                                                                                                                                                                                                                   ▒
       │    │  cmp       %r11,%r8                                                                                                                                                                                                                                                                                          ▒
       │    │<core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next:                                                                                                                                                                                                                              ▒
       │    │↑ je        500                                                                                                                                                                                                                                                                                               ▒
       │    │core::iter::adapters::map::Map<I,F>::new:                                                                                                                                                                                                                                                                     ▒
       │5b0:│  movq      $0x0,0x38(%rsp)                                                                                                                                                                                                                                                                                   ▒
       │    │<core::iter::adapters::skip::Skip<I> as core::iter::traits::iterator::Iterator>::next:                                                                                                                                                                                                                        ▒
       │    │  movq      $0x3e9,0x18(%rsp)                                                                                                                                                                                                                                                                                 ▒
       │    │core::hint::black_box:                                                                                                                                                                                                                                                                                        ▒
       │    │  movq      $0x3e8,(%rsp)                                                                                                                                                                                                                                                                                     ▒
       │    │  mov       (%rsp),%rcx                                                                                                                                                                                                                                                                                       ▒
       │    │  mov       $0x1,%edx                                                                                                                                                                                                                                                                                         ▒
       │    │↓ jmp       603                                                                                                                                                                                                                                                                                               ▒
       │    │<core::iter::adapters::skip::Skip<I> as core::iter::traits::iterator::Iterator>::next:                                                                                                                                                                                                                        ▒
       │    │  data16    cs nopw 0x0(%rax,%rax,1)                                                                                                                                                                                                                                                                          ▒
       │5e0:│  lea       0x1(%rsi),%rcx                                                                                                                                                                                                                                                                                    ▒
  0.95 │    │  mov       %rcx,(%rax)                                                                                                                                                                                                                                                                                       ▒
       │    │core::hint::black_box:                                                                                                                                                                                                                                                                                        ▒
  2.88 │    │  mov       %rsi,(%rsp)                                                                                                                                                                                                                                                                                       ▒
  3.82 │    │  mov       (%rsp),%rcx                                                                                                                                                                                                                                                                                       ▒
  0.05 │    │  mov       $0x1,%ebx                                                                                                                                                                                                                                                                                         ▒
       │    │  mov       %rdx,%rbp                                                                                                                                                                                                                                                                                         ▒
       │    │<i64 as core::iter::traits::accum::Sum>::sum::{{closure}}:                                                                                                                                                                                                                                                    ▒
  2.12 │5f7:│  add       %r9,%rcx                                                                                                                                                                                                                                                                                          ▒
  1.28 │    │  mov       %rbp,%rdx                                                                                                                                                                                                                                                                                         ▒
       │    │core::iter::traits::iterator::Iterator::fold:                                                                                                                                                                                                                                                                 ▒
       │    ├──cmp       $0x1,%rbx                                                                                                                                                                                                                                                                                         ▒
  0.80 │    └──jne       5a0                                                                                                                                                                                                                                                                                               ▒
       │603:   mov       %rcx,%r9                                                                                                                                                                                                                                                                                          ▒
       │     core::option::Option<T>::as_mut:                                                                                                                                                                                                                                                                              ▒
  1.94 │       test      %rdx,%rdx                                                                                                                                                                                                                                                                                         ▒
       │     core::iter::adapters::chain::and_then_or_clear:                                                                                                                                                                                                                                                               ◆
  3.30 │     ↓ je        61c                                                                                                                                                                                                                                                                                               ▒
       │       mov       0x18(%rsp),%rsi                                                                                                                                                                                                                                                                                   ▒
 34.43 │       mov       %r15,%rax                                                                                                                                                                                                                                                                                         ▒
       │       cmp       $0xf4240,%rsi                                                                                                                                                                                                                                                                                     ▒
       │     ↑ jl        5e0                                                                                                                                                                                                                                                                                               ▒
       │     <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next:                                                                                                                                                                                                                              ▒
       │61c:   mov       0x38(%rsp),%rsi                                                                                                                                                                                                                                                                                   ▒
 31.74 │       mov       $0x0,%edx                                                                                                                                                                                                                                                                                         ▒
       │       mov       %r14,%rax                                                                                                                                                                                                                                                                                         ▒
       │       mov       $0x0,%ebp                                                                                                                                                                                                                                                                                         ▒
       │       mov       $0x0,%ebx                                                                                                                                                                                                                                                                                         ▒
       │     core::cmp::impls::<impl core::cmp::PartialOrd for i64>::lt:                                                                                                                                                                                                                                                   ▒
       │       cmp       $0xf423f,%rsi                                                                                                                                                                                                                                                                                     ▒
       │     <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next:                                                                                                                                                                                                                              ▒
       │     ↑ jle       5e0                                                                                                                                                                                                                                                                                               ▒

HEAD:

perf stat
test iter::bench_skip_chain_ref_sum                                ... bench:   2,111,466 ns/iter (+/- 62,704)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 411 filtered out; finished in 4.59s

	finished in 4.744 seconds
Build completed successfully in 0:00:05

 Performance counter stats for './x bench --keep-stage 0 --stage 1 library/core/ --test-args iter::bench_skip_chain_ref_sum':

          5,541.40 msec task-clock                       #    1.000 CPUs utilized          
               777      context-switches                 #  140.217 /sec                   
               151      cpu-migrations                   #   27.249 /sec                   
            71,916      page-faults                      #   12.978 K/sec                  
    21,760,400,089      cycles                           #    3.927 GHz                      (83.85%)
       247,295,961      stalled-cycles-frontend          #    1.14% frontend cycles idle     (84.11%)
    11,342,182,645      stalled-cycles-backend           #   52.12% backend cycles idle      (83.88%)
   108,241,690,775      instructions                     #    4.97  insn per cycle         
                                                  #    0.10  stalled cycles per insn  (84.00%)
    13,281,399,988      branches                         #    2.397 G/sec                    (84.11%)
        13,076,998      branch-misses                    #    0.10% of all branches          (83.17%)

       5.538899860 seconds time elapsed

       5.256745000 seconds user
       0.274922000 seconds sys
perf report (cycles)
       │     <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next:                                                                                                                                                                                                                              ▒
       │     ↑ je        550                                                                                                                                                                                                                                                                                               ▒
       │       xor       %r8d,%r8d                                                                                                                                                                                                                                                                                         ▒
       │     ↓ jmp       603                                                                                                                                                                                                                                                                                               ▒
       │     core::num::<impl u64>::unchecked_add:                                                                                                                                                                                                                                                                         ▒
       │       data16    data16 cs nopw 0x0(%rax,%rax,1)                                                                                                                                                                                                                                                                   ▒
       │5f0:┌─→inc       %r8                                                                                                                                                                                                                                                                                               ▒
       │    │core::hint::black_box:                                                                                                                                                                                                                                                                                        ▒
       │    │  mov       %r9,0x8(%rsp)                                                                                                                                                                                                                                                                                     ▒
       │    │core::cmp::impls::<impl core::cmp::PartialOrd for u64>::lt:                                                                                                                                                                                                                                                   ▒
       │    │  cmp       0x10(%rsp),%r8                                                                                                                                                                                                                                                                                    ▒
       │    │<core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next:                                                                                                                                                                                                                              ▒
       │    │↑ je        550                                                                                                                                                                                                                                                                                               ▒
       │    │core::hint::black_box:                                                                                                                                                                                                                                                                                        ▒
       │603:│  movq      $0x3e8,0x8(%rsp)                                                                                                                                                                                                                                                                                  ▒
       │    │  mov       0x8(%rsp),%rcx                                                                                                                                                                                                                                                                                    ▒
       │    │  mov       $0x1,%ebp                                                                                                                                                                                                                                                                                         ▒
       │    │  mov       $0x3e9,%eax                                                                                                                                                                                                                                                                                       ▒
       │    │  xor       %esi,%esi                                                                                                                                                                                                                                                                                         ▒
       │    │  mov       $0x1,%edi                                                                                                                                                                                                                                                                                         ▒
       │    │  mov       $0x1,%ebx                                                                                                                                                                                                                                                                                         ▒
       │    │↓ jmp       677                                                                                                                                                                                                                                                                                               ▒
       │    │  nop                                                                                                                                                                                                                                                                                                         ▒
       │630:│  mov       %rbp,%r10                                                                                                                                                                                                                                                                                         ▒
  0.58 │    │  mov       %rsi,%rcx                                                                                                                                                                                                                                                                                         ▒
  9.70 │    │  mov       %rdi,%rbp                                                                                                                                                                                                                                                                                         ▒
       │    │  mov       %rax,%rsi                                                                                                                                                                                                                                                                                         ▒
  8.87 │    │  mov       %rcx,%rax                                                                                                                                                                                                                                                                                         ▒
       │63f:│  mov       %rsi,0x8(%rsp)                                                                                                                                                                                                                                                                                    ▒
       │    │<core::iter::adapters::skip::Skip<I> as core::iter::traits::iterator::Iterator>::next:                                                                                                                                                                                                                        ▒
  0.04 │    │  inc       %rsi                                                                                                                                                                                                                                                                                              ▒
       │    │core::hint::black_box:                                                                                                                                                                                                                                                                                        ▒
       │    │  mov       0x8(%rsp),%rcx                                                                                                                                                                                                                                                                                    ▒
 37.04 │    │  mov       $0x1,%r15d                                                                                                                                                                                                                                                                                        ▒
       │    │  mov       %r10,%r11                                                                                                                                                                                                                                                                                         ▒
  7.90 │    │  mov       %rbx,%r12                                                                                                                                                                                                                                                                                         ▒
       │    │core::mem::swap_simple:                                                                                                                                                                                                                                                                                       ▒
       │658:│  mov       %rbp,%rdi                                                                                                                                                                                                                                                                                         ▒
       │    │  mov       %rax,%rdx                                                                                                                                                                                                                                                                                         ▒
       │    │<i64 as core::iter::traits::accum::Sum>::sum::{{closure}}:                                                                                                                                                                                                                                                    ▒
       │    │  add       %r9,%rcx                                                                                                                                                                                                                                                                                          ▒
       │    │  mov       %rsi,%rax                                                                                                                                                                                                                                                                                         ▒
       │    │  mov       %r11,%rbp                                                                                                                                                                                                                                                                                         ▒
 10.10 │    │  mov       %rdx,%rsi                                                                                                                                                                                                                                                                                         ▒
       │    │  mov       %r12,%rbx                                                                                                                                                                                                                                                                                         ▒
       │    │core::iter::traits::iterator::Iterator::fold:                                                                                                                                                                                                                                                                 ▒
  7.83 │    ├──cmp       $0x1,%r15                                                                                                                                                                                                                                                                                         ▒
       │    └──jne       5f0                                                                                                                                                                                                                                                                                               ▒
       │677:   mov       %rcx,%r9                                                                                                                                                                                                                                                                                          ▒
       │     core::option::Option<T>::as_mut:                                                                                                                                                                                                                                                                              ▒
       │       test      %rbx,%rbx                                                                                                                                                                                                                                                                                         ▒
       │     core::iter::adapters::chain::and_then_or_clear:                                                                                                                                                                                                                                                               ▒
       │     ↓ je        689                                                                                                                                                                                                                                                                                               ▒
       │     core::cmp::impls::<impl core::cmp::PartialOrd for i64>::lt:                                                                                                                                                                                                                                                   ◆
       │       cmp       $0xf4240,%rax                                                                                                                                                                                                                                                                                     ▒
       │     <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next:                                                                                                                                                                                                                              ▒
       │     ↑ jl        630                                                                                                                                                                                                                                                                                               ▒
       │     <core::iter::adapters::chain::Chain<A,B> as core::iter::adapters::chain::SpecChain>::next:                                                                                                                                                                                                                    ▒
       │       xor       %ebp,%ebp                                                                                                                                                                                                                                                                                         ▒
       │     core::option::Option<T>::as_mut:                                                                                                                                                                                                                                                                              ▒
       │689:   test      %rdi,%rdi                                                                                                                                                                                                                                                                                         ▒
       │     core::iter::adapters::chain::and_then_or_clear:                                                                                                                                                                                                                                                               ▒
       │     ↓ je        6b3                                                                                                                                                                                                                                                                                               

Absolute numbers aren't comparable due to libtest's adaptive iteration count (a kingdom for an for a fixed iteration option...) but it's clear that IPC is higher and the number of branch misses is lower.

@the8472
Copy link
Member Author

the8472 commented Feb 23, 2023

For an apples-to-apples comparison, here are my x86_64-unknown-linux-gnu binaries

Using your binaries:

AMD 3960X, Arch Linux
 iter::bench_enumerate_chain_ref_sum   3,796,358       2,140,264           -1,656,094  -43.62%   x 1.77 
 iter::bench_filter_chain_count        971,643         1,052,222               80,579    8.29%   x 0.92 
 iter::bench_filter_chain_ref_count    3,912,179       2,389,902           -1,522,277  -38.91%   x 1.64 
 iter::bench_filter_chain_ref_sum      1,726,115       1,668,260              -57,855   -3.35%   x 1.03 
 iter::bench_flat_map_chain_ref_sum    4,929,951       6,990,377            2,060,426   41.79%   x 0.71 
 iter::bench_for_each_chain_loop       3,881,896       1,900,245           -1,981,651  -51.05%   x 2.04 
 iter::bench_for_each_chain_ref_fold   3,971,489       2,058,431           -1,913,058  -48.17%   x 1.93 
 iter::bench_fuse_chain_ref_sum        3,870,879       2,390,027           -1,480,852  -38.26%   x 1.62 
 iter::bench_inspect_chain_ref_sum     3,903,231       2,059,174           -1,844,057  -47.24%   x 1.90 
 iter::bench_peekable_chain_ref_sum    3,912,336       2,059,438           -1,852,898  -47.36%   x 1.90 
 iter::bench_peekable_chain_sum        569,274         474,788                -94,486  -16.60%   x 1.20 
 iter::bench_skip_chain_ref_sum        3,798,435       2,138,934           -1,659,501  -43.69%   x 1.78 
 iter::bench_skip_chain_sum            692,548         475,012               -217,536  -31.41%   x 1.46 
 iter::bench_skip_while_chain_ref_sum  3,781,100       3,897,186              116,086    3.07%   x 0.97 
 iter::bench_skip_while_chain_sum      462,010         475,298                 13,288    2.88%   x 0.97 
 iter::bench_slice_chain_ref_sum       554             436                       -118  -21.30%   x 1.27 
 iter::bench_take_while_chain_ref_sum  1,244,736       1,056,973             -187,763  -15.08%   x 1.18 
 iter::bench_take_while_chain_sum      256,337         501,667                245,330   95.71%   x 0.51 

So yeah, it's the CPU.

@cuviper
Copy link
Member

cuviper commented Feb 23, 2023

AMD did improve the branch predictor in Zen 3, so it seems we're seeing that to great effect.

base (7a9805368ebc):

running 1 test
test iter::bench_skip_chain_ref_sum                                ... bench:   1,281,926 ns/iter (+/- 13,244)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 411 filtered out; finished in 5.94s


 Performance counter stats for './7a9805368ebc/bin/corebenches-ac3e1e8c078c7482 --bench iter::bench_skip_chain_ref_sum':

          5,942.66 msec task-clock:u                     #    1.000 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               176      page-faults:u                    #   29.616 /sec
    27,870,211,887      cycles:u                         #    4.690 GHz                      (83.32%)
         8,065,226      stalled-cycles-frontend:u        #    0.03% frontend cycles idle     (83.32%)
            13,336      stalled-cycles-backend:u         #    0.00% backend cycles idle      (83.33%)
   166,482,113,215      instructions:u                   #    5.97  insn per cycle
                                                  #    0.00  stalled cycles per insn  (83.34%)
    26,990,127,586      branches:u                       #    4.542 G/sec                    (83.34%)
            47,550      branch-misses:u                  #    0.00% of all branches          (83.34%)

       5.943147770 seconds time elapsed

       5.909855000 seconds user
       0.000984000 seconds sys

new (beb9614c03b7):

running 1 test
test iter::bench_skip_chain_ref_sum                                ... bench:   1,673,832 ns/iter (+/- 5,689)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 411 filtered out; finished in 0.50s


 Performance counter stats for './beb9614c03b7/bin/corebenches-ac3e1e8c078c7482 --bench iter::bench_skip_chain_ref_sum':

            504.67 msec task-clock:u                     #    1.000 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               173      page-faults:u                    #  342.798 /sec
     2,412,468,602      cycles:u                         #    4.780 GHz                      (83.24%)
           469,826      stalled-cycles-frontend:u        #    0.02% frontend cycles idle     (83.36%)
               153      stalled-cycles-backend:u         #    0.00% backend cycles idle      (83.36%)
    15,038,910,520      instructions:u                   #    6.23  insn per cycle
                                                  #    0.00  stalled cycles per insn  (83.36%)
     1,804,833,553      branches:u                       #    3.576 G/sec                    (83.36%)
            19,104      branch-misses:u                  #    0.00% of all branches          (83.34%)

       0.504889089 seconds time elapsed

       0.502324000 seconds user
       0.000000000 seconds sys

So the branch-misses and stalled-cycles are negligible in either case. My i7-7700K is similar with negligible branch-misses (and stalls aren't reported).

So yeah, it's the CPU.

I'm not sure how to make that decision. I tend to think we should favor the future, especially since the code is simpler without specialization, but that appears biased toward my machines. :)

@cuviper cuviper added the I-libs-nominated Nominated for discussion during a libs team meeting. label Feb 23, 2023
@the8472
Copy link
Member Author

the8472 commented Feb 23, 2023

I'll try on an intel laptop and check if the code can be tweaked further.

@the8472 the8472 removed the I-libs-nominated Nominated for discussion during a libs team meeting. label Mar 1, 2023
@the8472
Copy link
Member Author

the8472 commented Mar 1, 2023

From T-libs meeting: We're unlikely to take this optimization as long as the results are mixed (benefiting some CPUs while regressing others) and the optimization is unsafe and tricky.
If an improvement can be found that makes it neutral to positive on all CPUs it's more likely to be accepted.

@cuviper
Copy link
Member

cuviper commented Mar 14, 2023

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 14, 2023
@bors
Copy link
Contributor

bors commented Mar 28, 2023

☔ The latest upstream changes (presumably #109692) made this pull request unmergeable. Please resolve the merge conflicts.

@JohnCSimon
Copy link
Member

@the8472
ping from triage - can you post your status on this PR? There hasn't been an update in a few months. Thanks!

@rust-log-analyzer
Copy link
Collaborator

The job mingw-check-tidy failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
Prepare all required actions
Getting action download info
Download action repository 'actions/checkout@v4' (SHA:8ade135a41bc03ea155e62e844d188df1ea18608)
Download action repository 'actions/upload-artifact@v3' (SHA:a8a3f3ad30e3422c9c7b888a15615d19a852ae32)
Complete job name: PR - mingw-check-tidy
git config --global core.autocrlf false
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
---
GITHUB_ENV=/home/runner/work/_temp/_runner_file_commands/set_env_bbdbdb06-6428-438f-bc9f-40eda3d11128
GITHUB_EVENT_NAME=pull_request
GITHUB_EVENT_PATH=/home/runner/work/_temp/_github_workflow/event.json
GITHUB_GRAPHQL_URL=https://api.github.com/graphql
GITHUB_HEAD_REF=chain-swap
GITHUB_JOB=pr
GITHUB_PATH=/home/runner/work/_temp/_runner_file_commands/add_path_bbdbdb06-6428-438f-bc9f-40eda3d11128
GITHUB_REF=refs/pull/107701/merge
GITHUB_REF_NAME=107701/merge
GITHUB_REF_PROTECTED=false
---
Removing intermediate container 61ebf8bb67d8
 ---> 0d710da557bd
Step 6/10 : COPY host-x86_64/mingw-check/reuse-requirements.txt /tmp/
 ---> e4d2bffd5e00
Step 7/10 : RUN pip3 install --no-deps --no-cache-dir --require-hashes -r /tmp/reuse-requirements.txt     && pip3 install virtualenv
Collecting binaryornot==0.4.4
  Downloading binaryornot-0.4.4-py2.py3-none-any.whl (9.0 kB)
Collecting boolean-py==4.0
  Downloading boolean.py-4.0-py3-none-any.whl (25 kB)
---
Building wheels for collected packages: reuse
  Building wheel for reuse (pyproject.toml): started
  Building wheel for reuse (pyproject.toml): finished with status 'done'
  Created wheel for reuse: filename=reuse-1.1.0-cp310-cp310-manylinux_2_35_x86_64.whl size=180117 sha256=2196c9034bf565528bbb1ee6dad4f753eb813f58822363e6b768f09c73e4d4ff
  Stored in directory: /tmp/pip-ephem-wheel-cache-i40_h0zg/wheels/c2/3c/b9/1120c2ab4bd82694f7e6f0537dc5b9a085c13e2c69a8d0c76d
Installing collected packages: boolean-py, binaryornot, setuptools, reuse, python-debian, markupsafe, license-expression, jinja2, chardet
  Attempting uninstall: setuptools
    Found existing installation: setuptools 59.6.0
    Not uninstalling setuptools at /usr/lib/python3/dist-packages, outside environment /usr
---
  Downloading virtualenv-20.24.5-py3-none-any.whl (3.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.7/3.7 MB 77.6 MB/s eta 0:00:00
Collecting filelock<4,>=3.12.2
  Downloading filelock-3.12.4-py3-none-any.whl (11 kB)
Collecting platformdirs<4,>=3.9.1
  Downloading platformdirs-3.11.0-py3-none-any.whl (17 kB)
Collecting distlib<1,>=0.3.7
  Downloading distlib-0.3.7-py2.py3-none-any.whl (468 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 468.9/468.9 KB 105.4 MB/s eta 0:00:00
Installing collected packages: distlib, platformdirs, filelock, virtualenv
Successfully installed distlib-0.3.7 filelock-3.12.4 platformdirs-3.11.0 virtualenv-20.24.5
Removing intermediate container 7630ee10c952
 ---> 07ca2c7903a7
Step 8/10 : COPY host-x86_64/mingw-check/validate-toolstate.sh /scripts/
 ---> 6d6b3925d74e
 ---> 6d6b3925d74e
Step 9/10 : COPY host-x86_64/mingw-check/validate-error-codes.sh /scripts/
 ---> 2a7a85788faa
Step 10/10 : ENV SCRIPT TIDY_PRINT_DIFF=1 python2.7 ../x.py test            --stage 0 src/tools/tidy tidyselftest --extra-checks=py:lint
Removing intermediate container 83da0a6d3688
 ---> 8a562a0fd07c
Successfully built 8a562a0fd07c
Successfully tagged rust-ci:latest
Successfully tagged rust-ci:latest
##[endgroup]
Built container sha256:8a562a0fd07c2523aaf717bce7515b1f5cdfbc91e129f10a02b5712fb7f15d18
Uploading finished image sha256:8a562a0fd07c2523aaf717bce7515b1f5cdfbc91e129f10a02b5712fb7f15d18 to https://ci-caches.rust-lang.org/docker/8849b25aebb63c7041ab10114da59fac9c6c89ff409673e53f6251b7e63c69daeaca7298d30885d05004ab27b231421908523f297222d07a53450f37e4691d72
IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
8a562a0fd07c   1 second ago     /bin/sh -c #(nop)  ENV SCRIPT=TIDY_PRINT_DIF…   0B        
6d6b3925d74e   2 seconds ago    /bin/sh -c #(nop) COPY file:078ea1d11e7b7cda…   367B      
07ca2c7903a7   4 seconds ago    |1 DEBIAN_FRONTEND=noninteractive /bin/sh -c…   23.9MB    
e4d2bffd5e00   10 seconds ago   /bin/sh -c #(nop) COPY file:ac591dd6bc5afa66…   5.33kB    
0d710da557bd   11 seconds ago   |1 DEBIAN_FRONTEND=noninteractive /bin/sh -c…   23.1MB    
---
<missing>      6 weeks ago      /bin/sh -c #(nop)  LABEL org.opencontainers.…   0B        
<missing>      6 weeks ago      /bin/sh -c #(nop)  ARG LAUNCHPAD_BUILD_ARCH     0B        
<missing>      6 weeks ago      /bin/sh -c #(nop)  ARG RELEASE                  0B        

<botocore.awsrequest.AWSRequest object at 0x7fd319557350>
gzip: stdout: Broken pipe
xargs: docker: terminated by signal 13
https://ci-caches.rust-lang.org/docker/8849b25aebb63c7041ab10114da59fac9c6c89ff409673e53f6251b7e63c69daeaca7298d30885d05004ab27b231421908523f297222d07a53450f37e4691d72
sha256:8a562a0fd07c2523aaf717bce7515b1f5cdfbc91e129f10a02b5712fb7f15d18
---
DirectMap4k:      194496 kB
DirectMap2M:     7145472 kB
DirectMap1G:    11534336 kB
##[endgroup]
Executing TIDY_PRINT_DIFF=1 python2.7 ../x.py test            --stage 0 src/tools/tidy tidyselftest --extra-checks=py:lint
+ TIDY_PRINT_DIFF=1 python2.7 ../x.py test --stage 0 src/tools/tidy tidyselftest --extra-checks=py:lint
    Finished dev [unoptimized] target(s) in 0.03s
##[endgroup]
downloading https://ci-artifacts.rust-lang.org/rustc-builds-alt/5333b878c8bc1c4267a67ea3682663629e47541a/rust-dev-nightly-x86_64-unknown-linux-gnu.tar.xz
extracting /checkout/obj/build/cache/llvm-5333b878c8bc1c4267a67ea3682663629e47541a-true/rust-dev-nightly-x86_64-unknown-linux-gnu.tar.xz to /checkout/obj/build/x86_64-unknown-linux-gnu/ci-llvm
---
   Compiling tidy v0.1.0 (/checkout/src/tools/tidy)
    Finished release [optimized] target(s) in 26.39s
##[endgroup]
fmt check
##[error]Diff in /checkout/library/core/src/iter/adapters/chain.rs at line 1:
 use crate::iter::{DoubleEndedIterator, FusedIterator, Iterator, TrustedLen};
 use crate::num::NonZeroUsize;
-use crate::{mem, ptr};
 use crate::ops::Try;
+use crate::{mem, ptr};
 
 /// An iterator that links two iterators together, in a chain.
 ///
##[error]Diff in /checkout/library/core/src/iter/adapters/chain.rs at line 371:
     #[inline]
     #[inline]
     fn next(&mut self) -> Option<A::Item> {
-        let mut result = self.a.as_mut().and_then( Iterator::next);
+        let mut result = self.a.as_mut().and_then(Iterator::next);
         if result.is_none() {
             if mem::needs_drop::<A>() {
                 // swap iters to avoid running drop code inside the loop.
##[error]Diff in /checkout/library/core/src/iter/adapters/chain.rs at line 399:
     #[inline]
     fn next_back(&mut self) -> Option<Self::Item> {
     fn next_back(&mut self) -> Option<Self::Item> {
-        let mut result = self.b.as_mut().and_then( DoubleEndedIterator::next_back);
+        let mut result = self.b.as_mut().and_then(DoubleEndedIterator::next_back);
         if result.is_none() {
             if mem::needs_drop::<A>() {
                 // swap iters to avoid running drop code inside the loop.
Running `"/checkout/obj/build/x86_64-unknown-linux-gnu/rustfmt/bin/rustfmt" "--config-path" "/checkout" "--edition" "2021" "--unstable-features" "--skip-children" "--check" "/checkout/library/core/src/iter/adapters/map_windows.rs" "/checkout/library/core/src/iter/adapters/chain.rs" "/checkout/library/core/src/iter/adapters/fuse.rs" "/checkout/library/core/src/iter/adapters/filter.rs" "/checkout/library/core/src/iter/adapters/intersperse.rs" "/checkout/library/core/src/iter/adapters/zip.rs" "/checkout/library/core/src/iter/adapters/by_ref_sized.rs" "/checkout/library/core/src/iter/adapters/cloned.rs"` failed.
If you're running `tidy`, try again with `--bless`. Or, if you just want to format code, run `./x.py fmt` instead.
  local time: Mon Oct  2 23:45:02 UTC 2023
  network time: Mon, 02 Oct 2023 23:45:02 GMT
##[error]Process completed with exit code 1.
Post job cleanup.

@the8472
Copy link
Member Author

the8472 commented Oct 3, 2023

Managed to get some improvements on intel but it's still mixed compared to the Zen2 results.

@the8472 the8472 closed this Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants