Improving asm generation #140

arczi84 · 2021-02-07T09:58:23Z

Hi, bebbo.
We are working on Sonic game port to Amiga and we noticed that generated asm code is really bad...
By simple tweaking asm we got 10fps more!
Here is comparison between original asm generated with O2, m68080 and fbbb with our little tweak:

https://www.textcompare.org/index.html?id=601fb8ead316300017a2c11f

As you can see and.l is responsible here for major CPU slowdown.
Do you think this could be tweaked?

bebbo · 2021-02-07T10:22:04Z

can you provide the C code snippet? Maybe even on https://franke.ms/cex/ ?

arczi84 · 2021-02-07T10:25:27Z

It's this function:
https://github.com/Rubberduckycooly/Sonic-1-2-2013-Decompilation/blob/a9d76f536202bcd26314d44a85cf8c164d682a54/Sonic12Decomp/Drawing.cpp#L614

bebbo · 2021-02-07T10:43:21Z

http://franke.ms/cex/z/MacqnM

arczi84 · 2021-02-07T10:52:58Z

Looks good, you did this or -m68040 does that?

bebbo · 2021-02-07T10:53:31Z

I changed nothing yet

arczi84 · 2021-02-07T10:55:54Z

I have gcc6 version 6.5.0b 200815191133 (GCC)
Is there newer one prebuilt?

arczi84 · 2021-02-07T11:47:51Z

So g++ is generating really bad code comparing to gcc.

bebbo · 2021-02-07T11:56:05Z

I don't see a difference between C and C++:
http://franke.ms/cex/z/Yf8Tcq
http://franke.ms/cex/z/58xoWo

bebbo · 2021-02-07T11:56:47Z

So g++ is generating really bad code comparing to gcc.

build was stalled... running a new Windows build atm.

arczi84 · 2021-02-07T12:08:02Z

Nice.

arczi84 · 2021-02-07T18:59:59Z

This version is even slower 54 fps vs 57fps -_-
Generated asm file is 943 now, was 901.

https://www.textcompare.org/index.html?id=60203858c7a2ab00178f6c3c

Maybe the difference comes from the RetroEngine.hpp header?

LTO doesn't work:
collect2: fatal error: COLLECT_LTO_WRAPPER must be set

Maybe it just need to be set somehere?
export COLLECT_LTO_WRAPPER=/d/amiga-gcc2/libexec/gcc/m68k-amigaos/6.5.0b/lto-wrapper
That didn't help though.

bebbo · 2021-02-07T19:12:51Z

do you have an "easy" compilable version of this project where I could measure stuff myself?

bebbo · 2021-02-07T19:32:42Z

https://www.textcompare.org/index.html?id=601fb8ead316300017a2c11f

Looking at the difference I'd guess you are using some -fbbb=... flags and e is not mentioned.

arczi84 · 2021-02-07T19:34:08Z

Do you have a Vampire4 ?
Game is optimised for it,
direct SAGA audio and SAGA RTG is used so it won't work on WinUAE.
But only needs libSDL, so it compiles easy so you could generate asm files and compare.
Here is official git hub https://github.com/Rubberduckycooly/Sonic-1-2-2013-Decompilation.
I have send sources on your e-mail.

arczi84 · 2021-02-07T19:35:20Z

I'm not using -fbbb=-

bebbo · 2021-02-07T20:25:18Z

This version is even slower 54 fps vs 57fps -_-
Generated asm file is 943 now, was 901.

https://www.textcompare.org/index.html?id=60203858c7a2ab00178f6c3c

Maybe the difference comes from the RetroEngine.hpp header?

LTO doesn't work:
collect2: fatal error: COLLECT_LTO_WRAPPER must be set

Maybe it just need to be set somehere?
export COLLECT_LTO_WRAPPER=/d/amiga-gcc2/libexec/gcc/m68k-amigaos/6.5.0b/lto-wrapper
That didn't help though.

uhm - have to check that for windows...
... why don't you switch to WSL (e.g. ubuntu)? It's so much faster!!!

arczi84 · 2021-02-07T20:33:22Z

I used WSL but I lost all files there...
I compile within Eclipse, it's so convinent!!
(with -j12 fast enough).

bebbo · 2021-02-07T21:56:03Z

I used WSL but I lost all files there...
I compile within Eclipse, it's so convinent!!
(with -j12 fast enough).

You can use the Linux Eclipse in Windows, simple use e.g. VcXSrv as XWindows server - https://sourceforge.net/projects/vcxsrv/files/vcxsrv/

... but that's not related ti this issue.

Back to the topic:

The code shown in CompilerExplorer looks better because (luckily) the registers high bytes are zero and the optimizer omits the and. So it's tough to force this somehow...

bebbo · 2021-02-08T07:03:04Z

I managed to reproduce the issue on cex: http://franke.ms/cex/z/rW3n66
even smaller: http://franke.ms/cex/z/6x8e5W

bebbo · 2021-02-09T06:31:03Z

progress: https://franke.ms/cex/z/8xecqW

arczi84 · 2021-02-09T08:16:20Z

very nice, that was it!
Just,
if you look at the code,
do you see that the compiler uses as INDEX D4
but as loop counter register A0.
This is not smart,
D4 as loop counter would be much better
and would allow faster and smaller code.

before each assignment of a byte or word from mem into a previously unused register, clear that register. This helps the existing optimizations to eliminate 'and' for unsigned extends. The 'clr' insns without effect are also removed. This leads to better code: http://franke.ms/cex/z/8xecqW Also fixed a bug in the usage analyzer.

arczi84 · 2021-02-09T08:41:45Z

There is nothing wrong with using a DATA register as INDEX,
you can use either DATA or ADDRESS reg as INDEX,
both works just the same way.

But for LOOP counts its generally always best to use a DATA REGISTER
Only on DATA register you can use DBRA
Only on DATA register you can count down with "free" Flags creation,
if you use an ADDRESS register for the loop count then you need in the work loop 1 extra CMP or TST instruction
this is not good.
In this example
subq.l #1,a0 move.l a0,d4 jne .L51
Here 3 instructions are used for doing the LOOP operation
subq for count down
move A0,D4 - to create flags (using a tmp register )
jne - for LOOP
this is 3 instructions, while the same could have been done with 1 DBRA.

arczi84 · 2021-02-09T08:48:35Z

add.l d4,d4
add.l d4,d4

this code makes no sense for higher CPUs
a shift would be better

bebbo · 2021-02-09T08:50:15Z

I'd guess: all data registers are used inside the loop...

bebbo · 2021-02-09T09:06:47Z

2 x ADD on 68040 is faster than a single shift

bebbo · 2021-02-09T09:08:20Z

err: it's shift:

arczi84 · 2021-02-09T09:12:02Z

also lets look at this:

clr.l d2
lea (a4,d2.l),a5
lea (9,a3,a5.l),a5

what does this 3 line do?
D2 is cleared
this means
lea (a4,d2.l),a5
is just an idiot form for MOVEA.l A4,A5
and A5 gets overwritten in the next row
so its useless
Lea (9,a3,a4.l),a5
is all what he wanted

this ASM code as it - is silly
Was the related C code also silly?

about the LOOP

look here:

move.w #100,a1
.L6:
move.l _xScrollOffset,(a3)+
subq.l #1,a1
move.l a1,d4
jne .L6

the compiler has D4 free and uses it as TMP register in the Loop
this code is very very bad

the Compiler should have done
move.w #99,D4
LOOP
move.l whatever,(A3)+
dbra d4,LOOP

bebbo · 2021-02-09T09:20:39Z

the compiler has D4 free and uses it as TMP register in the Loop
this code is very very bad

the Compiler should have done
move.w #99,D4
LOOP
move.l whatever,(A3)+
dbra d4,LOOP

d4 is used as scratch inside the loop:

        jeq .L23
        move.w (a2,d4.l*2),(4,a1)

or what code are you referring to?

The other issue is presumably create during jump optimization, where the expression is duplicated from above and d2 with some value inside. At the copied location d2 is set to zero.
Yes, that code looks silly - but it's not easy to treat.

GunnarVB · 2021-02-12T11:56:57Z

Hi Stefan can you make us again a link for the online tool to see the program? So that we can all share the same code and discuss? Cheers Gunnar Quoting bebbo <notifications@github.com>:

…

> > > Hallo Bebbo, > > ``` > moveq #99,d3 > ``` > > .L6: > move.l _xScrollOffset,(a0)+ > dbra d3,.L6 > clr.w d3 > subq.l #1,d3 > jcc .L6 > > This looks nice but would GCC be able to "SEE" if a constant fits > in WORD (below 65536) > and then only just use the DBRA without the out loop? you could use a `short` variable instead of an `int` --> you'd get the pure `dbra` -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #140 (comment)

arczi84 · 2021-02-12T11:57:53Z

http://franke.ms/cex/z/qqPrzE

GunnarVB · 2021-02-12T12:09:14Z

I notice something else which looks not right:
Lets look at this code please:
used flags" -m68040 -O2 -fomit-frame-pointer"

C-Code:
pixel=*gfxDataPtr++;
if (pixel > 0)
*frameBufferPtr = activePalette[*gfxDataPtr];
++frameBufferPtr;

Created ASM:
tst.b (3,a1) 4 byte
jeq .L24
move.b (4,a1),d5 4 byte
move.w (a3,d5.l*2),(6,a0)
.L24:

I think this ASM is not good as the "literal" translation of the C-Code would be.
I think literal tranlation should have been
Created ASM:
move.b (a1)+,D5 -- 2 2
jeq .L24
move.w (a3,d5.l*2),(6,a0)
.L24:

The code that GCC created is 4 bytes longer, 1 instruction longer, and several clocks slower.
Can this be fixed?
Can you help us understand why GCC prefered the worse code?

GunnarVB · 2021-02-12T12:19:00Z

Hi Stefan,

please ignore the above post.
I made a copy paster error.
I need to get used to the online tool. :)
The bug report above is not correct.

The correct report is this:

The created code looks like this:
.L21:
move.b (1,a2),d4
jeq .L22
move.w (a1,d4.l2),(2,a0)
.L22:
move.b (2,a2),d4
jeq .L23
move.w (a1,d4.l2),(4,a0)
.L23:
move.b (3,a2),d4
jeq .L24
move.w (a1,d4.l*2),(6,a0)

.L24:
move.b (4,a2),d4
jeq .L25
move.w (a1,d4.l*2),(8,a0)
move.b (1,a2),d4
move.b (2,a2),d4
move.b (3,a2),d4
move.b (4,a2),d4
then ADD

GCC prefers to use (16bitoffset,Ptr) instead using (PTR)++,
This leads to slower and much longer code.
In our example we increase code by 2 Byte per usage (4*2 bytes plus the size for PTR ADD )
I think this means 12 byte more code and 1 extra instruction in our case.

The (AN)++ address mode does not increase code size.
Can you help GCC to use it?

bebbo · 2021-02-12T13:12:46Z

http://franke.ms/cex/z/qqPrzE

typedef char byte; creates way worse code than typedef unsigned char byte;

GunnarVB · 2021-02-12T13:25:01Z

Hi Bebbo,

I have two more question:

1st) Can we help you in some way?
You mentioned that the GCC cost "database" is not correct.
That GCC does
MOVE.b (A0),D0
MOVE.b 1(A0),D1
MOVE.b 2(A0),D2
instead
MOVE.b (A0)+,D0
MOVE.b (A0)+,D1
MOVE.b (A0)+,D2

Seems to give you right.
As the 2nd version is shorter and better.

It there a way to help review this cost database for you and support you?

2nd) I wonder if there is a way to enable some tuned instruction for 68080 CPU.
For example when adding a CONSTANT to a LONG the 68K has the option do use.
ADDq.L #8,Dn -- for values below 8
ADDW.L #16bit,Dn -- for values up to 16bit
ADDi.L #32bit, Dn -- for LONG values

And for doing a COMPARE
CMPW.L #16bit,Dn -- for values up to 16bit
CMPi.L #32bit, Dn -- for LONG values

The "W.L" instruction above are new with 68080.
They allow to make shorter code, and faster when saving an instruction.

Opcode for ADDIW is
ADDIW.L 0000 0110 11 EA | 16bit signed imm

Opcode for CMPIW is
CMPIW.L 0100 1110 00 EA | 16bit signed imm

Could you enable them please?

bebbo · 2021-02-12T13:47:30Z

The costs for postincrement are actually the same as without offset. The challenge here is to analyze all branches to fix the insns and ensure the original value isn't used elsewhere.

For new opcodes the binutils assemble needs an enhancement first. Then it's possible to add what the assembler supports.

GunnarVB · 2021-02-12T14:01:53Z

The costs for postincrement are actually the same as without offset.
The challenge here is to analyze all branches to fix the insns and
ensure the original value isn't used elsewhere.

Not sure I fully understand.
For the same of argument lets say that
(a0) = cost 2
(a0)+ = cost 2
d16(a0) = cost 3

With such cost database values, would then GCC not automatically pick "lower" cost code?

GunnarVB · 2021-02-12T14:29:50Z

I fully understand that a compiler can not always create the "perfect" code.

But maybe we can help the compiler at least to create a reasonable compromise code?

I think an instruction has several costs.
a) cost in cycles
b) cost in instruction length - which has an effect on cache hit rate, and on instruction fetch
as a CPU can not load infinite amount of bytes from Icache per cycle.
c) memory access. While a CPU might be able to do a Dcache access for low cost or even free. There is risk of cache miss and then high cost. So if there are 2 instruction producing the same result, then using an option without memory access is preferable.

Would a cost database like the following make some sense?

COST1 = cycles10
COST2 = instruction_length2
COST3 = memory access *5

TOTAL COST = COST1+COST2+COST3

I believe that using an updated cost database for all instructions similar to the above would help GCC to create a "reasonable" selection of instruction.

Would it be possible to do this?

bebbo · 2021-02-12T15:10:25Z

gcc does not know full insns. It needs to know the cost for half insns but can distinguish between src an dst.

If the value 4 corresponds to 1 cycle you could e.g. define
src register -> 2
dst register -> 2
a move dx,dy would then yield the cost 4.

or you can define the cost of an operater with given parameters.
e.g. costs for AND = 0 on 68080 -> add yields same costs as move

This is for optimizing for speed. Optimizing for size currently uses the same costs for all insns, but the operand size would be a better criteria - someone could do that^^

to model memory access you'd have to define s state machine with pipelines, as I did for the 68080 floating point ops. Even more work :-)

bebbo · 2021-02-12T15:12:11Z

m68k_68000_10_costs.zip
have fun

bebbo · 2021-02-12T15:46:40Z

if you use short for counters you get this now:
http://franke.ms/cex/z/9je3qj

        move.b (a3)+,d3
        jeq .L21
        move.w (a2,d3.l*2),(a0)
.L21:
        move.b (a3)+,d3
        jeq .L22
        move.w (a2,d3.l*2),(2,a0)
.L22:
        move.b (a3)+,d3
        jeq .L23
        move.w (a2,d3.l*2),(4,a0)
.L23:
        move.b (a3)+,d3
        jeq .L24
        move.w (a2,d3.l*2),(6,a0)
.L24:
        move.b (a3),d3
        jeq .L25
        move.w (a2,d3.l*2),(8,a0)
.L25:
        lea (10,a0),a0
        dbra d1,.L26
        jra .L56

labels in the autoinc path must either be visited before or must not report the autoinc register as used.

bebbo · 2021-02-13T08:50:57Z

close pending :-)

GunnarVB · 2021-02-13T08:54:07Z

Very nice improvements, Thank you and well done!

Regarding the CostTable you send me.
We want to help you improve it.
What is the best way forward?

keeps these out of loops, if the clr is not used inside of loop.

- opt_strcpy is working again - clr's are removed top to down, to have it less likely in loops

bebbo · 2021-02-14T17:58:30Z

Very nice improvements, Thank you and well done!

Regarding the CostTable you send me.
We want to help you improve it.
What is the best way forward?

the best would be a working cost function for each cpu ^^

also good: a table which lists the source and dest costs for each operation and size. An example:

and.l d0,d1:
source: AND(d0, d1)
destination: d1

move.w (4,a1),d2
source: MEM(PLUS(a1, 4))
dest: d2

if destination costs are the same, the can be grouped.

costs for shift/rot/mul/div need a formula, but these should be quite ok already.

The cost for 1 cpu cycle is 4.
e.g. if the move.l d0,d1 takes one cycle, it should result in a cost of 4. and maybe the costs could be split up as src cost = 2 and dst cost = 2...

GunnarVB · 2021-02-14T18:14:49Z

Hi Bebbo

The cost for 1 cpu cycle is 4

Can we make this more finegrain?
Can we for example do the following:
Can we have 3 parameter

Cost of Cycle
Cost of instruction word
Cost of Memory Access.

Maybe define like this
Ccycle = 9
Cword = 2
Cmemory= 7

with such formular we get such result
MOVEQ #1,D0 = 11
MOVE.L #23456,D0 = 15
MOVE.L -8(SP),D0 = 18

Could we do this. Or is it required that Cycle is always =4?

bebbo · 2021-02-14T18:54:42Z

The cost for 1 cpu cycle is 4

Can we make this more finegrain?

you can do many multiples of 4 as you like. you could also use a base cost of 4, but some locations in gcc still assume that the cost per cycle is 4... I recommend to scale all values by 4.

plus you never provide the cost for a full instruction, only the cost for the source or the destination - and it's hard to split 1 with integers properly...

Can we for example do the following:
Can we have 3 parameter
1. Cost of Cycle

the cost of cycle is fixed and it is 4.

2. Cost of instruction word

for -Os it makes sense to provide a size base cost.

3. Cost of Memory Access.

this is done for each source or destination. If it's a MEM(...) then the expression inside is taken and the addressing costs are return based on the used addressing mode.

Maybe define like this
Ccycle = 9
Cword = 2
Cmemory= 7

memory costs are maby const for 68080, but not for other cpus, right?

with such formular we get such result
MOVEQ #1,D0 = 11
MOVE.L #23456,D0 = 15
MOVE.L -8(SP),D0 = 18

Could we do this. Or is it required that Cycle is always =4?

as written above - if I rewrite the cost stuff, I'll use the value 4 as cost for 1 cycle. If the instruction needs 8 cycles on a 68000 it should yield 32.

GunnarVB · 2021-02-14T19:39:32Z

Cost of instruction word
for -Os it makes sense to provide a size base cost.

But instruction length always has also a cycle cost.
For 68000 each extra word cost 4 cycle
For 68020/030 each extra word cost 2 cycle

For 68040/68080 instructions length has less direct measurable cost.
But as longer the used instructions as higher the chances if consuming the icache and as higher the chances for caches misses = which cost then a lot for memory fetches.
Also each CPU has a Icache fetch /decoder limitation.
Its 2 Byte per 2 Cycle for 68020/68030 and 4 Byte per Cycle for 68060.
So as longer the instructions get as more cost are direct and indirect created.

I think its very important for the compiler to try to use the most shorted possible instruction for the job.

bebbo · 2021-02-14T20:19:06Z

I'm talking about the total cost of an instruction. all in. whatever. there is no extra cost for fetching or addressing, these costs are included. That's why for MEM some wierd calculation is used to determine the total cost:
: (ax), (ax)+
: -(ax)
: (n,ax)
: (n,ax,dy), ||n|| < 8
: (dx)
: (n,ax,dy.w/l*1/2/4/8)
: [(ax)]
: [(dx)]
...

and that's the challenge to model all-in costs for half instructions (src, dest) that the result for a full instruction is as close as possible to the real costs.

plus higher CPUs have worst, best and cache szenarios... I'd use values which close to the expected values - whatever this means...

and for -Os only the instruction size matters.

bebbo transferred this issue from bebbo/amiga-gcc Feb 9, 2021

bebbo added a commit that referenced this issue Feb 12, 2021

refs #140: improved autoinc

65def90

bebbo added a commit that referenced this issue Feb 12, 2021

refs #140: fix autoinc improvements

aa0f669

bebbo added a commit that referenced this issue Feb 12, 2021

refs #140: fix auto_inc

b04b969

labels in the autoinc path must either be visited before or must not report the autoinc register as used.

bebbo added a commit that referenced this issue Feb 13, 2021

refs #140: still mark the autoinc src reg as modified

52089de

bebbo added a commit that referenced this issue Feb 13, 2021

refs #140: remove unused clr top down - not bottom up.

cbfb577

keeps these out of loops, if the clr is not used inside of loop.

bebbo added a commit that referenced this issue Feb 13, 2021

refs #140: fixes regression tests

06863a6

- opt_strcpy is working again - clr's are removed top to down, to have it less likely in loops

bebbo added a commit that referenced this issue Feb 14, 2021

refs #140: only insert clr if a nearby and is found

47ecbd6

bebbo added a commit that referenced this issue Feb 14, 2021

refs #140: improve value tracking for move.b/w #n,dx

b58b6da

bebbo added a commit that referenced this issue Feb 14, 2021

refs #140: fix mask handling for .w/.b

7dd8a3b

bebbo closed this as completed Feb 15, 2021

bebbo added a commit that referenced this issue Feb 21, 2021

refs #140: new costs for 68080.

5bc96ac

Improving asm generation #140

Improving asm generation #140

Comments

arczi84 commented Feb 7, 2021

bebbo commented Feb 7, 2021

arczi84 commented Feb 7, 2021

bebbo commented Feb 7, 2021

arczi84 commented Feb 7, 2021

bebbo commented Feb 7, 2021

arczi84 commented Feb 7, 2021 • edited Loading

arczi84 commented Feb 7, 2021

bebbo commented Feb 7, 2021

bebbo commented Feb 7, 2021

arczi84 commented Feb 7, 2021

arczi84 commented Feb 7, 2021

bebbo commented Feb 7, 2021

bebbo commented Feb 7, 2021

arczi84 commented Feb 7, 2021

arczi84 commented Feb 7, 2021

bebbo commented Feb 7, 2021

arczi84 commented Feb 7, 2021

bebbo commented Feb 7, 2021

bebbo commented Feb 8, 2021 • edited Loading

bebbo commented Feb 9, 2021

arczi84 commented Feb 9, 2021 • edited Loading

arczi84 commented Feb 9, 2021 • edited Loading

arczi84 commented Feb 9, 2021 • edited Loading

bebbo commented Feb 9, 2021

bebbo commented Feb 9, 2021

bebbo commented Feb 9, 2021

arczi84 commented Feb 9, 2021

bebbo commented Feb 9, 2021 • edited Loading

GunnarVB commented Feb 12, 2021 via email

arczi84 commented Feb 12, 2021

GunnarVB commented Feb 12, 2021

GunnarVB commented Feb 12, 2021

bebbo commented Feb 12, 2021

GunnarVB commented Feb 12, 2021

1st) Can we help you in some way? You mentioned that the GCC cost "database" is not correct. That GCC does MOVE.b (A0),D0 MOVE.b 1(A0),D1 MOVE.b 2(A0),D2 instead MOVE.b (A0)+,D0 MOVE.b (A0)+,D1 MOVE.b (A0)+,D2

bebbo commented Feb 12, 2021

GunnarVB commented Feb 12, 2021

GunnarVB commented Feb 12, 2021

bebbo commented Feb 12, 2021

bebbo commented Feb 12, 2021

bebbo commented Feb 12, 2021 • edited Loading

bebbo commented Feb 13, 2021

GunnarVB commented Feb 13, 2021

bebbo commented Feb 14, 2021

GunnarVB commented Feb 14, 2021

bebbo commented Feb 14, 2021

GunnarVB commented Feb 14, 2021

bebbo commented Feb 14, 2021

arczi84 commented Feb 7, 2021 •

edited

Loading

bebbo commented Feb 8, 2021 •

edited

Loading

arczi84 commented Feb 9, 2021 •

edited

Loading

arczi84 commented Feb 9, 2021 •

edited

Loading

arczi84 commented Feb 9, 2021 •

edited

Loading

bebbo commented Feb 9, 2021 •

edited

Loading

1st) Can we help you in some way?
You mentioned that the GCC cost "database" is not correct.
That GCC does
MOVE.b (A0),D0
MOVE.b 1(A0),D1
MOVE.b 2(A0),D2
instead
MOVE.b (A0)+,D0
MOVE.b (A0)+,D1
MOVE.b (A0)+,D2

bebbo commented Feb 12, 2021 •

edited

Loading