cmd/compile: missed opportunity to coalesce reads/writes #41663

josharian · 2020-09-27T22:24:29Z

package p

import "encoding/binary"

func f(b []byte, x *[8]byte) {
	_ = b[8]
	t := binary.LittleEndian.Uint64(x[:])
	binary.LittleEndian.PutUint64(b, t)
}

This should compile down to two MOVQs on amd64, one to load from x and one to write to b.

Instead, it compiles to a series of smaller MOVxs. The coalescing rules may need more cases added.

cc @randall77 @dr2chase @martisch @mundaym

agarciamontoro · 2020-09-28T09:02:13Z

I'd like to work on this one!

mundaym · 2020-09-28T09:52:05Z

@agarciamontoro Thanks for offering to take a look. I suspect the best place to start is to add a new test in test/codegen/memcombine.go. You'll want to check that no small MOV* operations are generated using the annotations, for example:

func f(b []byte, x *[8]byte) {
	_ = b[8]
        // amd64:-`MOVB`,-`MOVW`,-`MOVL`
	binary.LittleEndian.PutUint64(b, binary.LittleEndian.Uint64(x[:]))
}

After verifying that your new test fails the tricky bit will be trying to figure out why the existing rules don't work. Most likely another optimization is being applied first and that is interfering with the pattern match. GOSSAFUNC will be useful for trying to figure out what is going on. You can find the optimization rules for load/store merging on AMD64 in

go/src/cmd/compile/internal/ssa/gen/AMD64.rules

Lines 1500 to 1995 in 874b313

    
           // Combining byte loads into larger (unaligned) loads. 
        
           // There are many ways these combinations could occur.  This is 
        
           // designed to match the way encoding/binary.LittleEndian does it. 
        
           // Little-endian loads 
        
           (OR(L|Q)                  x0:(MOVBload [i0] {s} p mem) 
        
               sh:(SHL(L|Q)const [8] x1:(MOVBload [i1] {s} p mem))) 
        
             && i1 == i0+1 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && sh.Uses == 1 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, sh) 
        
             => @mergePoint(b,x0,x1) (MOVWload [i0] {s} p mem) 
        
           (OR(L|Q)                  x0:(MOVBload [i] {s} p0 mem) 
        
               sh:(SHL(L|Q)const [8] x1:(MOVBload [i] {s} p1 mem))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && sh.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, sh) 
        
             => @mergePoint(b,x0,x1) (MOVWload [i] {s} p0 mem) 
        
           (OR(L|Q)                   x0:(MOVWload [i0] {s} p mem) 
        
               sh:(SHL(L|Q)const [16] x1:(MOVWload [i1] {s} p mem))) 
        
             && i1 == i0+2 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && sh.Uses == 1 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, sh) 
        
             => @mergePoint(b,x0,x1) (MOVLload [i0] {s} p mem) 
        
           (OR(L|Q)                   x0:(MOVWload [i] {s} p0 mem) 
        
               sh:(SHL(L|Q)const [16] x1:(MOVWload [i] {s} p1 mem))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && sh.Uses == 1 
        
             && sequentialAddresses(p0, p1, 2) 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, sh) 
        
             => @mergePoint(b,x0,x1) (MOVLload [i] {s} p0 mem) 
        
           (ORQ                   x0:(MOVLload [i0] {s} p mem) 
        
               sh:(SHLQconst [32] x1:(MOVLload [i1] {s} p mem))) 
        
             && i1 == i0+4 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && sh.Uses == 1 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, sh) 
        
             => @mergePoint(b,x0,x1) (MOVQload [i0] {s} p mem) 
        
           (ORQ                   x0:(MOVLload [i] {s} p0 mem) 
        
               sh:(SHLQconst [32] x1:(MOVLload [i] {s} p1 mem))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && sh.Uses == 1 
        
             && sequentialAddresses(p0, p1, 4) 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, sh) 
        
             => @mergePoint(b,x0,x1) (MOVQload [i] {s} p0 mem) 
        
           (OR(L|Q) 
        
               s1:(SHL(L|Q)const [j1] x1:(MOVBload [i1] {s} p mem)) 
        
               or:(OR(L|Q) 
        
                   s0:(SHL(L|Q)const [j0] x0:(MOVBload [i0] {s} p mem)) 
        
           	y)) 
        
             && i1 == i0+1 
        
             && j1 == j0+8 
        
             && j0 % 16 == 0 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && s0.Uses == 1 
        
             && s1.Uses == 1 
        
             && or.Uses == 1 
        
             && mergePoint(b,x0,x1,y) != nil 
        
             && clobber(x0, x1, s0, s1, or) 
        
             => @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j0] (MOVWload [i0] {s} p mem)) y) 
        
           (OR(L|Q) 
        
               s1:(SHL(L|Q)const [j1] x1:(MOVBload [i] {s} p1 mem)) 
        
               or:(OR(L|Q) 
        
                   s0:(SHL(L|Q)const [j0] x0:(MOVBload [i] {s} p0 mem)) 
        
           	y)) 
        
             && j1 == j0+8 
        
             && j0 % 16 == 0 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && s0.Uses == 1 
        
             && s1.Uses == 1 
        
             && or.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && mergePoint(b,x0,x1,y) != nil 
        
             && clobber(x0, x1, s0, s1, or) 
        
             => @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j0] (MOVWload [i] {s} p0 mem)) y) 
        
           (ORQ 
        
               s1:(SHLQconst [j1] x1:(MOVWload [i1] {s} p mem)) 
        
               or:(ORQ 
        
                   s0:(SHLQconst [j0] x0:(MOVWload [i0] {s} p mem)) 
        
           	y)) 
        
             && i1 == i0+2 
        
             && j1 == j0+16 
        
             && j0 % 32 == 0 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && s0.Uses == 1 
        
             && s1.Uses == 1 
        
             && or.Uses == 1 
        
             && mergePoint(b,x0,x1,y) != nil 
        
             && clobber(x0, x1, s0, s1, or) 
        
             => @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j0] (MOVLload [i0] {s} p mem)) y) 
        
           (ORQ 
        
               s1:(SHLQconst [j1] x1:(MOVWload [i] {s} p1 mem)) 
        
               or:(ORQ 
        
                   s0:(SHLQconst [j0] x0:(MOVWload [i] {s} p0 mem)) 
        
           	y)) 
        
             && j1 == j0+16 
        
             && j0 % 32 == 0 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && s0.Uses == 1 
        
             && s1.Uses == 1 
        
             && or.Uses == 1 
        
             && sequentialAddresses(p0, p1, 2) 
        
             && mergePoint(b,x0,x1,y) != nil 
        
             && clobber(x0, x1, s0, s1, or) 
        
             => @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j0] (MOVLload [i] {s} p0 mem)) y) 
        
           // Big-endian loads 
        
           (OR(L|Q) 
        
                                      x1:(MOVBload [i1] {s} p mem) 
        
               sh:(SHL(L|Q)const [8]  x0:(MOVBload [i0] {s} p mem))) 
        
             && i1 == i0+1 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && sh.Uses == 1 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, sh) 
        
             => @mergePoint(b,x0,x1) (ROLWconst <v.Type> [8] (MOVWload [i0] {s} p mem)) 
        
           (OR(L|Q) 
        
                                      x1:(MOVBload [i] {s} p1 mem) 
        
               sh:(SHL(L|Q)const [8]  x0:(MOVBload [i] {s} p0 mem))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && sh.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, sh) 
        
             => @mergePoint(b,x0,x1) (ROLWconst <v.Type> [8] (MOVWload [i] {s} p0 mem)) 
        
           (OR(L|Q) 
        
                                       r1:(ROLWconst [8] x1:(MOVWload [i1] {s} p mem)) 
        
               sh:(SHL(L|Q)const [16]  r0:(ROLWconst [8] x0:(MOVWload [i0] {s} p mem)))) 
        
             && i1 == i0+2 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && r0.Uses == 1 
        
             && r1.Uses == 1 
        
             && sh.Uses == 1 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, r0, r1, sh) 
        
             => @mergePoint(b,x0,x1) (BSWAPL <v.Type> (MOVLload [i0] {s} p mem)) 
        
           (OR(L|Q) 
        
                                       r1:(ROLWconst [8] x1:(MOVWload [i] {s} p1 mem)) 
        
               sh:(SHL(L|Q)const [16]  r0:(ROLWconst [8] x0:(MOVWload [i] {s} p0 mem)))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && r0.Uses == 1 
        
             && r1.Uses == 1 
        
             && sh.Uses == 1 
        
             && sequentialAddresses(p0, p1, 2) 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, r0, r1, sh) 
        
             => @mergePoint(b,x0,x1) (BSWAPL <v.Type> (MOVLload [i] {s} p0 mem)) 
        
           (ORQ 
        
                                   r1:(BSWAPL x1:(MOVLload [i1] {s} p mem)) 
        
               sh:(SHLQconst [32]  r0:(BSWAPL x0:(MOVLload [i0] {s} p mem)))) 
        
             && i1 == i0+4 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && r0.Uses == 1 
        
             && r1.Uses == 1 
        
             && sh.Uses == 1 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, r0, r1, sh) 
        
             => @mergePoint(b,x0,x1) (BSWAPQ <v.Type> (MOVQload [i0] {s} p mem)) 
        
           (ORQ 
        
                                   r1:(BSWAPL x1:(MOVLload [i] {s} p1 mem)) 
        
               sh:(SHLQconst [32]  r0:(BSWAPL x0:(MOVLload [i] {s} p0 mem)))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && r0.Uses == 1 
        
             && r1.Uses == 1 
        
             && sh.Uses == 1 
        
             && sequentialAddresses(p0, p1, 4) 
        
             && mergePoint(b,x0,x1) != nil 
        
             && clobber(x0, x1, r0, r1, sh) 
        
             => @mergePoint(b,x0,x1) (BSWAPQ <v.Type> (MOVQload [i] {s} p0 mem)) 
        
           (OR(L|Q) 
        
               s0:(SHL(L|Q)const [j0] x0:(MOVBload [i0] {s} p mem)) 
        
               or:(OR(L|Q) 
        
                   s1:(SHL(L|Q)const [j1] x1:(MOVBload [i1] {s} p mem)) 
        
           	y)) 
        
             && i1 == i0+1 
        
             && j1 == j0-8 
        
             && j1 % 16 == 0 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && s0.Uses == 1 
        
             && s1.Uses == 1 
        
             && or.Uses == 1 
        
             && mergePoint(b,x0,x1,y) != nil 
        
             && clobber(x0, x1, s0, s1, or) 
        
             => @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j1] (ROLWconst <typ.UInt16> [8] (MOVWload [i0] {s} p mem))) y) 
        
           (OR(L|Q) 
        
               s0:(SHL(L|Q)const [j0] x0:(MOVBload [i] {s} p0 mem)) 
        
               or:(OR(L|Q) 
        
                   s1:(SHL(L|Q)const [j1] x1:(MOVBload [i] {s} p1 mem)) 
        
           	y)) 
        
             && j1 == j0-8 
        
             && j1 % 16 == 0 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && s0.Uses == 1 
        
             && s1.Uses == 1 
        
             && or.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && mergePoint(b,x0,x1,y) != nil 
        
             && clobber(x0, x1, s0, s1, or) 
        
             => @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j1] (ROLWconst <typ.UInt16> [8] (MOVWload [i] {s} p0 mem))) y) 
        
           (ORQ 
        
               s0:(SHLQconst [j0] r0:(ROLWconst [8] x0:(MOVWload [i0] {s} p mem))) 
        
               or:(ORQ 
        
                   s1:(SHLQconst [j1] r1:(ROLWconst [8] x1:(MOVWload [i1] {s} p mem))) 
        
           	y)) 
        
             && i1 == i0+2 
        
             && j1 == j0-16 
        
             && j1 % 32 == 0 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && r0.Uses == 1 
        
             && r1.Uses == 1 
        
             && s0.Uses == 1 
        
             && s1.Uses == 1 
        
             && or.Uses == 1 
        
             && mergePoint(b,x0,x1,y) != nil 
        
             && clobber(x0, x1, r0, r1, s0, s1, or) 
        
             => @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j1] (BSWAPL <typ.UInt32> (MOVLload [i0] {s} p mem))) y) 
        
           (ORQ 
        
               s0:(SHLQconst [j0] r0:(ROLWconst [8] x0:(MOVWload [i] {s} p0 mem))) 
        
               or:(ORQ 
        
                   s1:(SHLQconst [j1] r1:(ROLWconst [8] x1:(MOVWload [i] {s} p1 mem))) 
        
           	y)) 
        
             && j1 == j0-16 
        
             && j1 % 32 == 0 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && r0.Uses == 1 
        
             && r1.Uses == 1 
        
             && s0.Uses == 1 
        
             && s1.Uses == 1 
        
             && or.Uses == 1 
        
             && sequentialAddresses(p0, p1, 2) 
        
             && mergePoint(b,x0,x1,y) != nil 
        
             && clobber(x0, x1, r0, r1, s0, s1, or) 
        
             => @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j1] (BSWAPL <typ.UInt32> (MOVLload [i] {s} p0 mem))) y) 
        
           // Combine 2 byte stores + shift into rolw 8 + word store 
        
           (MOVBstore [i] {s} p w 
        
             x0:(MOVBstore [i-1] {s} p (SHRWconst [8] w) mem)) 
        
             && x0.Uses == 1 
        
             && clobber(x0) 
        
             => (MOVWstore [i-1] {s} p (ROLWconst <w.Type> [8] w) mem) 
        
           (MOVBstore [i] {s} p1 w 
        
             x0:(MOVBstore [i] {s} p0 (SHRWconst [8] w) mem)) 
        
             && x0.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && clobber(x0) 
        
             => (MOVWstore [i] {s} p0 (ROLWconst <w.Type> [8] w) mem) 
        
           // Combine stores + shifts into bswap and larger (unaligned) stores 
        
           (MOVBstore [i] {s} p w 
        
             x2:(MOVBstore [i-1] {s} p (SHRLconst [8] w) 
        
             x1:(MOVBstore [i-2] {s} p (SHRLconst [16] w) 
        
             x0:(MOVBstore [i-3] {s} p (SHRLconst [24] w) mem)))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && x2.Uses == 1 
        
             && clobber(x0, x1, x2) 
        
             => (MOVLstore [i-3] {s} p (BSWAPL <w.Type> w) mem) 
        
           (MOVBstore [i] {s} p3 w 
        
             x2:(MOVBstore [i] {s} p2 (SHRLconst [8] w) 
        
             x1:(MOVBstore [i] {s} p1 (SHRLconst [16] w) 
        
             x0:(MOVBstore [i] {s} p0 (SHRLconst [24] w) mem)))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && x2.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && sequentialAddresses(p1, p2, 1) 
        
             && sequentialAddresses(p2, p3, 1) 
        
             && clobber(x0, x1, x2) 
        
             => (MOVLstore [i] {s} p0 (BSWAPL <w.Type> w) mem) 
        
           (MOVBstore [i] {s} p w 
        
             x6:(MOVBstore [i-1] {s} p (SHRQconst [8] w) 
        
             x5:(MOVBstore [i-2] {s} p (SHRQconst [16] w) 
        
             x4:(MOVBstore [i-3] {s} p (SHRQconst [24] w) 
        
             x3:(MOVBstore [i-4] {s} p (SHRQconst [32] w) 
        
             x2:(MOVBstore [i-5] {s} p (SHRQconst [40] w) 
        
             x1:(MOVBstore [i-6] {s} p (SHRQconst [48] w) 
        
             x0:(MOVBstore [i-7] {s} p (SHRQconst [56] w) mem)))))))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && x2.Uses == 1 
        
             && x3.Uses == 1 
        
             && x4.Uses == 1 
        
             && x5.Uses == 1 
        
             && x6.Uses == 1 
        
             && clobber(x0, x1, x2, x3, x4, x5, x6) 
        
             => (MOVQstore [i-7] {s} p (BSWAPQ <w.Type> w) mem) 
        
           (MOVBstore [i] {s} p7 w 
        
             x6:(MOVBstore [i] {s} p6 (SHRQconst [8] w) 
        
             x5:(MOVBstore [i] {s} p5 (SHRQconst [16] w) 
        
             x4:(MOVBstore [i] {s} p4 (SHRQconst [24] w) 
        
             x3:(MOVBstore [i] {s} p3 (SHRQconst [32] w) 
        
             x2:(MOVBstore [i] {s} p2 (SHRQconst [40] w) 
        
             x1:(MOVBstore [i] {s} p1 (SHRQconst [48] w) 
        
             x0:(MOVBstore [i] {s} p0 (SHRQconst [56] w) mem)))))))) 
        
             && x0.Uses == 1 
        
             && x1.Uses == 1 
        
             && x2.Uses == 1 
        
             && x3.Uses == 1 
        
             && x4.Uses == 1 
        
             && x5.Uses == 1 
        
             && x6.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && sequentialAddresses(p1, p2, 1) 
        
             && sequentialAddresses(p2, p3, 1) 
        
             && sequentialAddresses(p3, p4, 1) 
        
             && sequentialAddresses(p4, p5, 1) 
        
             && sequentialAddresses(p5, p6, 1) 
        
             && sequentialAddresses(p6, p7, 1) 
        
             && clobber(x0, x1, x2, x3, x4, x5, x6) 
        
             => (MOVQstore [i] {s} p0 (BSWAPQ <w.Type> w) mem) 
        
           // Combine constant stores into larger (unaligned) stores. 
        
           (MOVBstoreconst [c] {s} p x:(MOVBstoreconst [a] {s} p mem)) 
        
             && x.Uses == 1 
        
             && a.Off() + 1 == c.Off() 
        
             && clobber(x) 
        
             => (MOVWstoreconst [makeValAndOff64(a.Val()&0xff | c.Val()<<8, a.Off())] {s} p mem) 
        
           (MOVBstoreconst [a] {s} p x:(MOVBstoreconst [c] {s} p mem)) 
        
             && x.Uses == 1 
        
             && a.Off() + 1 == c.Off() 
        
             && clobber(x) 
        
             => (MOVWstoreconst [makeValAndOff64(a.Val()&0xff | c.Val()<<8, a.Off())] {s} p mem) 
        
           (MOVWstoreconst [c] {s} p x:(MOVWstoreconst [a] {s} p mem)) 
        
             && x.Uses == 1 
        
             && a.Off() + 2 == c.Off() 
        
             && clobber(x) 
        
             => (MOVLstoreconst [makeValAndOff64(a.Val()&0xffff | c.Val()<<16, a.Off())] {s} p mem) 
        
           (MOVWstoreconst [a] {s} p x:(MOVWstoreconst [c] {s} p mem)) 
        
             && x.Uses == 1 
        
             && a.Off() + 2 == c.Off() 
        
             && clobber(x) 
        
             => (MOVLstoreconst [makeValAndOff64(a.Val()&0xffff | c.Val()<<16, a.Off())] {s} p mem) 
        
           (MOVLstoreconst [c] {s} p x:(MOVLstoreconst [a] {s} p mem)) 
        
             && x.Uses == 1 
        
             && a.Off() + 4 == c.Off() 
        
             && clobber(x) 
        
             => (MOVQstore [a.Off32()] {s} p (MOVQconst [a.Val()&0xffffffff | c.Val()<<32]) mem) 
        
           (MOVLstoreconst [a] {s} p x:(MOVLstoreconst [c] {s} p mem)) 
        
             && x.Uses == 1 
        
             && a.Off() + 4 == c.Off() 
        
             && clobber(x) 
        
             => (MOVQstore [a.Off32()] {s} p (MOVQconst [a.Val()&0xffffffff | c.Val()<<32]) mem) 
        
           (MOVQstoreconst [c] {s} p x:(MOVQstoreconst [c2] {s} p mem)) 
        
             && config.useSSE 
        
             && x.Uses == 1 
        
             && c2.Off() + 8 == c.Off() 
        
             && c.Val() == 0 
        
             && c2.Val() == 0 
        
             && clobber(x) 
        
             => (MOVOstore [c2.Off32()] {s} p (MOVOconst [0]) mem) 
        
           // Combine stores into larger (unaligned) stores. Little endian. 
        
           (MOVBstore [i] {s} p (SHR(W|L|Q)const [8] w) x:(MOVBstore [i-1] {s} p w mem)) 
        
             && x.Uses == 1 
        
             && clobber(x) 
        
             => (MOVWstore [i-1] {s} p w mem) 
        
           (MOVBstore [i] {s} p w x:(MOVBstore [i+1] {s} p (SHR(W|L|Q)const [8] w) mem)) 
        
             && x.Uses == 1 
        
             && clobber(x) 
        
             => (MOVWstore [i] {s} p w mem) 
        
           (MOVBstore [i] {s} p (SHR(L|Q)const [j] w) x:(MOVBstore [i-1] {s} p w0:(SHR(L|Q)const [j-8] w) mem)) 
        
             && x.Uses == 1 
        
             && clobber(x) 
        
             => (MOVWstore [i-1] {s} p w0 mem) 
        
           (MOVBstore [i] {s} p1 (SHR(W|L|Q)const [8] w) x:(MOVBstore [i] {s} p0 w mem)) 
        
             && x.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && clobber(x) 
        
             => (MOVWstore [i] {s} p0 w mem) 
        
           (MOVBstore [i] {s} p0 w x:(MOVBstore [i] {s} p1 (SHR(W|L|Q)const [8] w) mem)) 
        
             && x.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && clobber(x) 
        
             => (MOVWstore [i] {s} p0 w mem) 
        
           (MOVBstore [i] {s} p1 (SHR(L|Q)const [j] w) x:(MOVBstore [i] {s} p0 w0:(SHR(L|Q)const [j-8] w) mem)) 
        
             && x.Uses == 1 
        
             && sequentialAddresses(p0, p1, 1) 
        
             && clobber(x) 
        
             => (MOVWstore [i] {s} p0 w0 mem) 
        
           (MOVWstore [i] {s} p (SHR(L|Q)const [16] w) x:(MOVWstore [i-2] {s} p w mem)) 
        
             && x.Uses == 1 
        
             && clobber(x) 
        
             => (MOVLstore [i-2] {s} p w mem) 
        
           (MOVWstore [i] {s} p (SHR(L|Q)const [j] w) x:(MOVWstore [i-2] {s} p w0:(SHR(L|Q)const [j-16] w) mem)) 
        
             && x.Uses == 1 
        
             && clobber(x) 
        
             => (MOVLstore [i-2] {s} p w0 mem) 
        
           (MOVWstore [i] {s} p1 (SHR(L|Q)const [16] w) x:(MOVWstore [i] {s} p0 w mem)) 
        
             && x.Uses == 1 
        
             && sequentialAddresses(p0, p1, 2) 
        
             && clobber(x) 
        
             => (MOVLstore [i] {s} p0 w mem) 
        
           (MOVWstore [i] {s} p1 (SHR(L|Q)const [j] w) x:(MOVWstore [i] {s} p0 w0:(SHR(L|Q)const [j-16] w) mem)) 
        
             && x.Uses == 1 
        
             && sequentialAddresses(p0, p1, 2) 
        
             && clobber(x) 
        
             => (MOVLstore [i] {s} p0 w0 mem) 
        
           (MOVLstore [i] {s} p (SHRQconst [32] w) x:(MOVLstore [i-4] {s} p w mem)) 
        
             && x.Uses == 1 
        
             && clobber(x) 
        
             => (MOVQstore [i-4] {s} p w mem) 
        
           (MOVLstore [i] {s} p (SHRQconst [j] w) x:(MOVLstore [i-4] {s} p w0:(SHRQconst [j-32] w) mem)) 
        
             && x.Uses == 1 
        
             && clobber(x) 
        
             => (MOVQstore [i-4] {s} p w0 mem) 
        
           (MOVLstore [i] {s} p1 (SHRQconst [32] w) x:(MOVLstore [i] {s} p0 w mem)) 
        
             && x.Uses == 1 
        
             && sequentialAddresses(p0, p1, 4) 
        
             && clobber(x) 
        
             => (MOVQstore [i] {s} p0 w mem) 
        
           (MOVLstore [i] {s} p1 (SHRQconst [j] w) x:(MOVLstore [i] {s} p0 w0:(SHRQconst [j-32] w) mem)) 
        
             && x.Uses == 1 
        
             && sequentialAddresses(p0, p1, 4) 
        
             && clobber(x) 
        
             => (MOVQstore [i] {s} p0 w0 mem) 
        
           (MOVBstore [i] {s} p 
        
             x1:(MOVBload [j] {s2} p2 mem) 
        
               mem2:(MOVBstore [i-1] {s} p 
        
                 x2:(MOVBload [j-1] {s2} p2 mem) mem)) 
        
             && x1.Uses == 1 
        
             && x2.Uses == 1 
        
             && mem2.Uses == 1 
        
             && clobber(x1, x2, mem2) 
        
             => (MOVWstore [i-1] {s} p (MOVWload [j-1] {s2} p2 mem) mem) 
        
           (MOVWstore [i] {s} p 
        
             x1:(MOVWload [j] {s2} p2 mem) 
        
               mem2:(MOVWstore [i-2] {s} p 
        
                 x2:(MOVWload [j-2] {s2} p2 mem) mem)) 
        
             && x1.Uses == 1 
        
             && x2.Uses == 1 
        
             && mem2.Uses == 1 
        
             && clobber(x1, x2, mem2) 
        
             => (MOVLstore [i-2] {s} p (MOVLload [j-2] {s2} p2 mem) mem) 
        
           (MOVLstore [i] {s} p 
        
             x1:(MOVLload [j] {s2} p2 mem) 
        
               mem2:(MOVLstore [i-4] {s} p 
        
                 x2:(MOVLload [j-4] {s2} p2 mem) mem)) 
        
             && x1.Uses == 1 
        
             && x2.Uses == 1 
        
             && mem2.Uses == 1 
        
             && clobber(x1, x2, mem2) 
        
             => (MOVQstore [i-4] {s} p (MOVQload [j-4] {s2} p2 mem) mem)

.

Finally if you get a chance, try and check that other architectures optimize the pattern too. arm64, ppc64le and s390x all also do unaligned load/store merging.

mundaym · 2020-09-28T09:53:49Z

Aside: it would be really nice to do the unaligned load/store merging optimizations in a generic optimization pass. These rules are quite hard to maintain when there are a lot of optimizations that might interfere with the target patterns.

mvdan · 2020-09-28T09:59:58Z

As a quick aside, when new rules are added, is there anything to warn us about them making other rules suddenly trigger less often? With the huge amount of rules we have today, it's practically impossible to foresee that kind of interaction.

I guess we can keep adding more and more tests to cover common cases, but it still feels like some sort of tooling would be nice. I imagine there are a few rules today that basically never trigger anymore, for example.

mundaym · 2020-09-28T10:11:27Z

@mvdan I use compilecmp a lot (https://github.com/josharian/compilecmp). That at least tells me when the generated code for a function grows significantly (often a sign I've broken a pre-existing rewrite rule). In practice the codegen tests are also fairly good at catching a lot of stuff, there is quite a lot of coverage there now.

agarciamontoro · 2020-09-29T08:00:10Z

Thank you for all the detailed info, I'll start investigating as soon as possible! I may get back to you in the following days with questions, this would be my first contribution to the project :)

josharian · 2020-12-15T02:52:06Z

I just bumped into this again. @agarciamontoro anything I can do to help you out here?

agarciamontoro · 2020-12-16T17:01:31Z

Hi @josharian, sorry for the ghosting here, life got in the way!

I did take a look into this a couple of weeks ago and learned quite a lot about how everything works. However, I got stuck when trying to identify which rules apply to the generated code here. My initial idea was to simply add a new rule that covers this specific issue, but I wanted to first find the root cause of the wrong rule (or rules).

The problem is that we're generating code to store a byte, then a long, then a word, and finally the last byte.

v34 (+79) = MOVQload <uint64> v7 v1
v171 (+84) = MOVBstore <mem> v248 v34 v1
v168 (+85) = SHRQconst <uint64> [8] v34
v216 (+89) = SHRQconst <uint64> [40] v34
v180 (+91) = SHRQconst <uint64> [56] v34
v219 (88) = MOVLstore <mem> [1] v248 v168 v171
v243 (90) = MOVWstore <mem> [5] v248 v216 v219
v255 (91) = MOVBstore <mem> [7] v248 v180 v243

There are of course rules for the "normal" cases, where we coalesce sequential byte stores in a quad. But there is not something for converting the sequence B, L, W, B into a Q. We could add a specific rule for that, but I would like to know why we have this weird sequence in the first place (which is where I got stuck).

If you could give me a hint on where to look at to debug this, it would be awesome.

josharian · 2020-12-17T04:55:57Z

No worries on disappearing. It’s been a tough, unpredictable year for everyone.

It’s not easy to debug ssa rewrite rules. There’s a logging mode you can use: pass the compiler the flag -d=ssa/lower/debug=2. But the output may be voluminous. And managing rewrite rule dependencies is a headache. As long as there’s a codegen test I’d personally be comfortable just matching the 1-4-2-1 pattern.

Don’t hesitate to ask if there’s more we can help with.

agarciamontoro · 2020-12-17T09:46:48Z

Thank you for the tip, I'll try to investigate the issue with the logging mode. If there's not any success, I'll simply match the weird pattern, I already have the codegen test.

Thanks for the help!

josharian · 2020-12-25T05:12:17Z

Marking this as Go 1.17. It is a regression, I've encountered it in multiple codebases, it has come up multiple times in Gopher Slack, and it is impacting design decisions in inet.af/netaddr.

This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in #63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <dave@natulte.net>

agarciamontoro · 2020-12-25T13:20:58Z

Expect something from my side this weekend, @josharian, I'm on vacation now and have a bit more time. If I'm not able to solve it by then, I'll let it free :)

josharian · 2020-12-25T16:11:13Z

We have a few months yet before 1.17, so no rush. :)

This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in #63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <dave@natulte.net>

This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in inetaf#63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <dave@natulte.net> Signed-off-by: Stefan Majer <stefan.majer@gmail.com>

agarciamontoro · 2020-12-28T00:38:42Z

It looks like a have a working rule, both for amd64:

(MOVBstore [7] p1 (SHRQconst [56] w)
  x1:(MOVWstore [5] p1 (SHRQconst [40] w)
  x2:(MOVLstore [1] p1 (SHRQconst [8] w)
  x3:(MOVBstore p1 w mem))))
  && x1.Uses == 1
  && x2.Uses == 1
  && x3.Uses == 1
  && clobber(x1, x2, x3)
  => (MOVQstore p1 w mem)

and for s390x, using its corresponding ops:

(MOVBstore [7] p1 (SRDconst w)
  x1:(MOVHBRstore [5] p1 (SRDconst w)
  x2:(MOVWBRstore [1] p1 (SRDconst w)
  x3:(MOVBstore p1 w mem))))
  && x1.Uses == 1
  && x2.Uses == 1
  && x3.Uses == 1
  && clobber(x1, x2, x3)
  => (MOVDBRstore p1 w mem)

The current rules seem to already optimize the generated code both for arm64 and ppc64le, so I did not need to do anything else there.

This change makes the new codegen test pass, but I do have a couple of questions before submitting a patch:

I'm not sure if these rules are too specific (both the pattern itself and the constants in the move and shift ops). Maybe specificity is not an issue at this step (you already said that you were ok matching the 1-4-2-1 pattern), but I'm curious about how you evaluate/measure that (if there is a need to evaluate that at all). I'm mostly concerned about avoidable performance issues that could arise from adding overly-specific rules that match weird patterns.
I tested this for other architectures using the --all_codegen flag (as in go run test/run.go --all_codegen -- test/codegen/memcombine.go), but I'm not sure whether this is the right approach. Should I do anything else to test this for architectures other than amd64?

agnivade · 2020-12-28T04:23:18Z

Nice work @agarciamontoro.

I'll let @josharian answer that. But we usually look at the impact of the change and how widespread it is. If it has a significant impact, and the change is minimal, it should be okay to add a specific pattern. But the question does come up as to why just use 1 and not the rest of 4!-1 patterns.
is the right approach. As long as you have architecture specific annotations, you don't need to do anything else.

agarciamontoro · 2020-12-28T16:28:19Z

Thank you, @agnivade!

But the question does come up as to why just use 1 and not the rest of 4!-1 patterns.

Yup, that's a good question. The specific pattern I used is the one generated when compiling the function in the description of the issue. My guess is that this happens because of the order of the previously applied rules, but I was not able to identify the ones that caused the problem: the output from -gcflags=-d=ssa/lower/debug=2 is indeed voluminous, and from what I saw, it logs every value changed instead of every rule applied (which can affect multiple values), so it's hard to keep track of what's happening. Is there some other debug mode I don't know of that we could use here?

I can probably come up with a rule that covers all the cases, but do we have proof of the other cases actually happening? I'm not sure how we could check that.

josharian · 2020-12-28T19:56:38Z

Yay!

I'm not sure if these rules are too specific (both the pattern itself and the constants in the move and shift ops). Maybe specificity is not an issue at this step (you already said that you were ok matching the 1-4-2-1 pattern), but I'm curious about how you evaluate/measure that (if there is a need to evaluate that at all).

The specificity is fine. This is a particular pattern that arises, so this is the pattern we match. It would be nice eventually do to something more principled around how we order and apply rewrite rules, but it gets very weedy very quickly, and this is Good Enough. And we have codegen tests to prevent regressions.

I'm mostly concerned about avoidable performance issues that could arise from adding overly-specific rules that match weird patterns.

The incremental cost of adding additional rewrite rules is very low.

If you're curious, you could add an && noteRule("1421") to the LHS of the rule and re-run make.bash to see how often it triggers. Or use compilecmp -fn=changed to see what functions get optimized. But in this case, because we know it shows up in the wild, that supporting evidence is not too critical.

gopherbot · 2020-12-30T19:07:14Z

Change https://golang.org/cl/280456 mentions this issue: cmd/compile: add rule to coalesce writes

agarciamontoro · 2020-12-30T19:31:58Z

Just sent the change matching the specific pattern. Thank you for the tip on the noteRule, though, it's super useful!

josharian added the Performance label Sep 27, 2020

josharian added this to the Unplanned milestone Sep 27, 2020

josharian added the help wanted label Sep 27, 2020

josharian mentioned this issue Sep 28, 2020

cmd/compile: combine BSWAP on amd64 #41684

Closed

andybons added the NeedsFix The path to resolution is known, but the work has not been done. label Sep 29, 2020

danderson mentioned this issue Dec 25, 2020

Change the representation of IP to a pair of uint64s. inetaf/netaddr#63

Merged

josharian modified the milestones: Unplanned, Go1.17 Dec 25, 2020

gopherbot closed this as completed in bf48163 Feb 24, 2021

golang locked and limited conversation to collaborators Feb 24, 2022

gopherbot added the FrozenDueToAge label Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: missed opportunity to coalesce reads/writes #41663

cmd/compile: missed opportunity to coalesce reads/writes #41663

josharian commented Sep 27, 2020 •

edited

Loading

agarciamontoro commented Sep 28, 2020

mundaym commented Sep 28, 2020 •

edited

Loading

mundaym commented Sep 28, 2020

mvdan commented Sep 28, 2020

mundaym commented Sep 28, 2020

agarciamontoro commented Sep 29, 2020

josharian commented Dec 15, 2020

agarciamontoro commented Dec 16, 2020

josharian commented Dec 17, 2020

agarciamontoro commented Dec 17, 2020

josharian commented Dec 25, 2020

agarciamontoro commented Dec 25, 2020

josharian commented Dec 25, 2020

agarciamontoro commented Dec 28, 2020 •

edited

Loading

agnivade commented Dec 28, 2020

agarciamontoro commented Dec 28, 2020

josharian commented Dec 28, 2020

gopherbot commented Dec 30, 2020

agarciamontoro commented Dec 30, 2020

cmd/compile: missed opportunity to coalesce reads/writes #41663

cmd/compile: missed opportunity to coalesce reads/writes #41663

Comments

josharian commented Sep 27, 2020 • edited Loading

agarciamontoro commented Sep 28, 2020

mundaym commented Sep 28, 2020 • edited Loading

mundaym commented Sep 28, 2020

mvdan commented Sep 28, 2020

mundaym commented Sep 28, 2020

agarciamontoro commented Sep 29, 2020

josharian commented Dec 15, 2020

agarciamontoro commented Dec 16, 2020

josharian commented Dec 17, 2020

agarciamontoro commented Dec 17, 2020

josharian commented Dec 25, 2020

agarciamontoro commented Dec 25, 2020

josharian commented Dec 25, 2020

agarciamontoro commented Dec 28, 2020 • edited Loading

agnivade commented Dec 28, 2020

agarciamontoro commented Dec 28, 2020

josharian commented Dec 28, 2020

gopherbot commented Dec 30, 2020

agarciamontoro commented Dec 30, 2020

josharian commented Sep 27, 2020 •

edited

Loading

mundaym commented Sep 28, 2020 •

edited

Loading

agarciamontoro commented Dec 28, 2020 •

edited

Loading