-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: missed opportunity to coalesce reads/writes #41663
Comments
I'd like to work on this one! |
@agarciamontoro Thanks for offering to take a look. I suspect the best place to start is to add a new test in func f(b []byte, x *[8]byte) {
_ = b[8]
// amd64:-`MOVB`,-`MOVW`,-`MOVL`
binary.LittleEndian.PutUint64(b, binary.LittleEndian.Uint64(x[:]))
} After verifying that your new test fails the tricky bit will be trying to figure out why the existing rules don't work. Most likely another optimization is being applied first and that is interfering with the pattern match. go/src/cmd/compile/internal/ssa/gen/AMD64.rules Lines 1500 to 1995 in 874b313
Finally if you get a chance, try and check that other architectures optimize the pattern too. |
Aside: it would be really nice to do the unaligned load/store merging optimizations in a generic optimization pass. These rules are quite hard to maintain when there are a lot of optimizations that might interfere with the target patterns. |
As a quick aside, when new rules are added, is there anything to warn us about them making other rules suddenly trigger less often? With the huge amount of rules we have today, it's practically impossible to foresee that kind of interaction. I guess we can keep adding more and more tests to cover common cases, but it still feels like some sort of tooling would be nice. I imagine there are a few rules today that basically never trigger anymore, for example. |
@mvdan I use compilecmp a lot (https://github.com/josharian/compilecmp). That at least tells me when the generated code for a function grows significantly (often a sign I've broken a pre-existing rewrite rule). In practice the codegen tests are also fairly good at catching a lot of stuff, there is quite a lot of coverage there now. |
Thank you for all the detailed info, I'll start investigating as soon as possible! I may get back to you in the following days with questions, this would be my first contribution to the project :) |
I just bumped into this again. @agarciamontoro anything I can do to help you out here? |
Hi @josharian, sorry for the ghosting here, life got in the way! I did take a look into this a couple of weeks ago and learned quite a lot about how everything works. However, I got stuck when trying to identify which rules apply to the generated code here. My initial idea was to simply add a new rule that covers this specific issue, but I wanted to first find the root cause of the wrong rule (or rules). The problem is that we're generating code to store a byte, then a long, then a word, and finally the last byte.
There are of course rules for the "normal" cases, where we coalesce sequential byte stores in a quad. But there is not something for converting the sequence B, L, W, B into a Q. We could add a specific rule for that, but I would like to know why we have this weird sequence in the first place (which is where I got stuck). If you could give me a hint on where to look at to debug this, it would be awesome. |
No worries on disappearing. It’s been a tough, unpredictable year for everyone. It’s not easy to debug ssa rewrite rules. There’s a logging mode you can use: pass the compiler the flag Don’t hesitate to ask if there’s more we can help with. |
Thank you for the tip, I'll try to investigate the issue with the logging mode. If there's not any success, I'll simply match the weird pattern, I already have the codegen test. Thanks for the help! |
Marking this as Go 1.17. It is a regression, I've encountered it in multiple codebases, it has come up multiple times in Gopher Slack, and it is impacting design decisions in inet.af/netaddr. |
This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in #63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <dave@natulte.net>
Expect something from my side this weekend, @josharian, I'm on vacation now and have a bit more time. If I'm not able to solve it by then, I'll let it free :) |
We have a few months yet before 1.17, so no rush. :) |
This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in #63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <dave@natulte.net>
This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in inetaf#63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <dave@natulte.net> Signed-off-by: Stefan Majer <stefan.majer@gmail.com>
It looks like a have a working rule, both for
and for
The current rules seem to already optimize the generated code both for This change makes the new codegen test pass, but I do have a couple of questions before submitting a patch:
|
Nice work @agarciamontoro.
|
Thank you, @agnivade!
Yup, that's a good question. The specific pattern I used is the one generated when compiling the function in the description of the issue. My guess is that this happens because of the order of the previously applied rules, but I was not able to identify the ones that caused the problem: the output from I can probably come up with a rule that covers all the cases, but do we have proof of the other cases actually happening? I'm not sure how we could check that. |
Yay!
The specificity is fine. This is a particular pattern that arises, so this is the pattern we match. It would be nice eventually do to something more principled around how we order and apply rewrite rules, but it gets very weedy very quickly, and this is Good Enough. And we have codegen tests to prevent regressions.
The incremental cost of adding additional rewrite rules is very low. If you're curious, you could add an |
Change https://golang.org/cl/280456 mentions this issue: |
Just sent the change matching the specific pattern. Thank you for the tip on the |
This should compile down to two MOVQs on amd64, one to load from x and one to write to b.
Instead, it compiles to a series of smaller MOVxs. The coalescing rules may need more cases added.
cc @randall77 @dr2chase @martisch @mundaym
The text was updated successfully, but these errors were encountered: