Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving asm generation #140

Closed
arczi84 opened this issue Feb 7, 2021 · 84 comments
Closed

Improving asm generation #140

arczi84 opened this issue Feb 7, 2021 · 84 comments
Labels
close pending the ticket will be closed soon, whether fixed or not

Comments

@arczi84
Copy link

arczi84 commented Feb 7, 2021

Hi, bebbo.
We are working on Sonic game port to Amiga and we noticed that generated asm code is really bad...
By simple tweaking asm we got 10fps more!
Here is comparison between original asm generated with O2, m68080 and fbbb with our little tweak:

https://www.textcompare.org/index.html?id=601fb8ead316300017a2c11f

As you can see and.l is responsible here for major CPU slowdown.
Do you think this could be tweaked?

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2021

can you provide the C code snippet? Maybe even on https://franke.ms/cex/ ?

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2021

@arczi84
Copy link
Author

arczi84 commented Feb 7, 2021

Looks good, you did this or -m68040 does that?

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2021

I changed nothing yet

@arczi84
Copy link
Author

arczi84 commented Feb 7, 2021

I have gcc6 version 6.5.0b 200815191133 (GCC)
Is there newer one prebuilt?

@arczi84
Copy link
Author

arczi84 commented Feb 7, 2021

So g++ is generating really bad code comparing to gcc.

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2021

I don't see a difference between C and C++:
http://franke.ms/cex/z/Yf8Tcq
http://franke.ms/cex/z/58xoWo

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2021

So g++ is generating really bad code comparing to gcc.

build was stalled... running a new Windows build atm.

@arczi84
Copy link
Author

arczi84 commented Feb 7, 2021

Nice.

@arczi84
Copy link
Author

arczi84 commented Feb 7, 2021

This version is even slower 54 fps vs 57fps -_-
Generated asm file is 943 now, was 901.

https://www.textcompare.org/index.html?id=60203858c7a2ab00178f6c3c

Maybe the difference comes from the RetroEngine.hpp header?


LTO doesn't work:
collect2: fatal error: COLLECT_LTO_WRAPPER must be set

Maybe it just need to be set somehere?
export COLLECT_LTO_WRAPPER=/d/amiga-gcc2/libexec/gcc/m68k-amigaos/6.5.0b/lto-wrapper
That didn't help though.

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2021

do you have an "easy" compilable version of this project where I could measure stuff myself?

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2021

https://www.textcompare.org/index.html?id=601fb8ead316300017a2c11f

Looking at the difference I'd guess you are using some -fbbb=... flags and e is not mentioned.

@arczi84
Copy link
Author

arczi84 commented Feb 7, 2021

Do you have a Vampire4 ?
Game is optimised for it,
direct SAGA audio and SAGA RTG is used so it won't work on WinUAE.
But only needs libSDL, so it compiles easy so you could generate asm files and compare.
Here is official git hub https://github.com/Rubberduckycooly/Sonic-1-2-2013-Decompilation.
I have send sources on your e-mail.

@arczi84
Copy link
Author

arczi84 commented Feb 7, 2021

I'm not using -fbbb=-

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2021

This version is even slower 54 fps vs 57fps -_-
Generated asm file is 943 now, was 901.

https://www.textcompare.org/index.html?id=60203858c7a2ab00178f6c3c

Maybe the difference comes from the RetroEngine.hpp header?

LTO doesn't work:
collect2: fatal error: COLLECT_LTO_WRAPPER must be set

Maybe it just need to be set somehere?
export COLLECT_LTO_WRAPPER=/d/amiga-gcc2/libexec/gcc/m68k-amigaos/6.5.0b/lto-wrapper
That didn't help though.

uhm - have to check that for windows...
... why don't you switch to WSL (e.g. ubuntu)? It's so much faster!!!

@arczi84
Copy link
Author

arczi84 commented Feb 7, 2021

I used WSL but I lost all files there...
I compile within Eclipse, it's so convinent!!
(with -j12 fast enough).

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2021

I used WSL but I lost all files there...
I compile within Eclipse, it's so convinent!!
(with -j12 fast enough).

You can use the Linux Eclipse in Windows, simple use e.g. VcXSrv as XWindows server - https://sourceforge.net/projects/vcxsrv/files/vcxsrv/

... but that's not related ti this issue.

Back to the topic:

The code shown in CompilerExplorer looks better because (luckily) the registers high bytes are zero and the optimizer omits the and. So it's tough to force this somehow...

@bebbo
Copy link
Owner

bebbo commented Feb 8, 2021

I managed to reproduce the issue on cex: http://franke.ms/cex/z/rW3n66
even smaller: http://franke.ms/cex/z/6x8e5W

@bebbo bebbo transferred this issue from bebbo/amiga-gcc Feb 9, 2021
@bebbo
Copy link
Owner

bebbo commented Feb 9, 2021

progress: https://franke.ms/cex/z/8xecqW

@arczi84
Copy link
Author

arczi84 commented Feb 9, 2021

very nice, that was it!
Just,
if you look at the code,
do you see that the compiler uses as INDEX D4
but as loop counter register A0.
This is not smart,
D4 as loop counter would be much better
and would allow faster and smaller code.

bebbo added a commit that referenced this issue Feb 9, 2021
before each assignment of a byte or word from mem into a previously
unused register, clear that register.

This helps the existing optimizations to eliminate 'and' for unsigned
extends. The 'clr' insns without effect are also removed.

This leads to better code: http://franke.ms/cex/z/8xecqW

Also fixed a bug in the usage analyzer.
@arczi84
Copy link
Author

arczi84 commented Feb 9, 2021

There is nothing wrong with using a DATA register as INDEX,
you can use either DATA or ADDRESS reg as INDEX,
both works just the same way.

But for LOOP counts its generally always best to use a DATA REGISTER
Only on DATA register you can use DBRA
Only on DATA register you can count down with "free" Flags creation,
if you use an ADDRESS register for the loop count then you need in the work loop 1 extra CMP or TST instruction
this is not good.
In this example
subq.l #1,a0 move.l a0,d4 jne .L51
Here 3 instructions are used for doing the LOOP operation
subq for count down
move A0,D4 - to create flags (using a tmp register )
jne - for LOOP
this is 3 instructions, while the same could have been done with 1 DBRA.

@arczi84
Copy link
Author

arczi84 commented Feb 9, 2021

add.l d4,d4
add.l d4,d4

this code makes no sense for higher CPUs
a shift would be better

@bebbo
Copy link
Owner

bebbo commented Feb 9, 2021

I'd guess: all data registers are used inside the loop...

@bebbo
Copy link
Owner

bebbo commented Feb 9, 2021

grafik
grafik

2 x ADD on 68040 is faster than a single shift

@bebbo
Copy link
Owner

bebbo commented Feb 9, 2021

err: it's shift:
grafik

@arczi84
Copy link
Author

arczi84 commented Feb 9, 2021

also lets look at this:

clr.l d2
lea (a4,d2.l),a5
lea (9,a3,a5.l),a5

what does this 3 line do?
D2 is cleared
this means
lea (a4,d2.l),a5
is just an idiot form for MOVEA.l A4,A5
and A5 gets overwritten in the next row
so its useless
Lea (9,a3,a4.l),a5
is all what he wanted

this ASM code as it - is silly
Was the related C code also silly?

about the LOOP

look here:

move.w #100,a1
.L6:
move.l _xScrollOffset,(a3)+
subq.l #1,a1
move.l a1,d4
jne .L6

the compiler has D4 free and uses it as TMP register in the Loop
this code is very very bad

the Compiler should have done
move.w #99,D4
LOOP
move.l whatever,(A3)+
dbra d4,LOOP

@bebbo
Copy link
Owner

bebbo commented Feb 9, 2021

the compiler has D4 free and uses it as TMP register in the Loop
this code is very very bad

the Compiler should have done
move.w #99,D4
LOOP
move.l whatever,(A3)+
dbra d4,LOOP

d4 is used as scratch inside the loop:

        jeq .L23
        move.w (a2,d4.l*2),(4,a1)

or what code are you referring to?

The other issue is presumably create during jump optimization, where the expression is duplicated from above and d2 with some value inside. At the copied location d2 is set to zero.
Yes, that code looks silly - but it's not easy to treat.

@GunnarVB
Copy link

GunnarVB commented Feb 12, 2021 via email

@arczi84
Copy link
Author

arczi84 commented Feb 12, 2021

@GunnarVB
Copy link

I notice something else which looks not right:
Lets look at this code please:
used flags" -m68040 -O2 -fomit-frame-pointer"

C-Code:
pixel=*gfxDataPtr++;
if (pixel > 0)
*frameBufferPtr = activePalette[*gfxDataPtr];
++frameBufferPtr;

Created ASM:
tst.b (3,a1) 4 byte
jeq .L24
move.b (4,a1),d5 4 byte
move.w (a3,d5.l*2),(6,a0)
.L24:

I think this ASM is not good as the "literal" translation of the C-Code would be.
I think literal tranlation should have been
Created ASM:
move.b (a1)+,D5 -- 2 2
jeq .L24
move.w (a3,d5.l*2),(6,a0)
.L24:

The code that GCC created is 4 bytes longer, 1 instruction longer, and several clocks slower.
Can this be fixed?
Can you help us understand why GCC prefered the worse code?

@GunnarVB
Copy link

Hi Stefan,

please ignore the above post.
I made a copy paster error.
I need to get used to the online tool. :)
The bug report above is not correct.

The correct report is this:

The created code looks like this:
.L21:
move.b (1,a2),d4
jeq .L22
move.w (a1,d4.l2),(2,a0)
.L22:
move.b (2,a2),d4
jeq .L23
move.w (a1,d4.l
2),(4,a0)
.L23:
move.b (3,a2),d4
jeq .L24
move.w (a1,d4.l*2),(6,a0)

.L24:
move.b (4,a2),d4
jeq .L25
move.w (a1,d4.l*2),(8,a0)
move.b (1,a2),d4
move.b (2,a2),d4
move.b (3,a2),d4
move.b (4,a2),d4
then ADD

GCC prefers to use (16bitoffset,Ptr) instead using (PTR)++,
This leads to slower and much longer code.
In our example we increase code by 2 Byte per usage (4*2 bytes plus the size for PTR ADD )
I think this means 12 byte more code and 1 extra instruction in our case.

The (AN)++ address mode does not increase code size.
Can you help GCC to use it?

@bebbo
Copy link
Owner

bebbo commented Feb 12, 2021

http://franke.ms/cex/z/qqPrzE

typedef char byte; creates way worse code than typedef unsigned char byte;

@GunnarVB
Copy link

Hi Bebbo,

I have two more question:

1st) Can we help you in some way?
You mentioned that the GCC cost "database" is not correct.
That GCC does
MOVE.b (A0),D0
MOVE.b 1(A0),D1
MOVE.b 2(A0),D2
instead
MOVE.b (A0)+,D0
MOVE.b (A0)+,D1
MOVE.b (A0)+,D2

Seems to give you right.
As the 2nd version is shorter and better.

It there a way to help review this cost database for you and support you?

2nd) I wonder if there is a way to enable some tuned instruction for 68080 CPU.
For example when adding a CONSTANT to a LONG the 68K has the option do use.
ADDq.L #8,Dn -- for values below 8
ADDW.L #16bit,Dn -- for values up to 16bit
ADDi.L #32bit, Dn -- for LONG values

And for doing a COMPARE
CMPW.L #16bit,Dn -- for values up to 16bit
CMPi.L #32bit, Dn -- for LONG values

The "W.L" instruction above are new with 68080.
They allow to make shorter code, and faster when saving an instruction.

Opcode for ADDIW is
ADDIW.L 0000 0110 11 EA | 16bit signed imm

Opcode for CMPIW is
CMPIW.L 0100 1110 00 EA | 16bit signed imm

Could you enable them please?

@bebbo
Copy link
Owner

bebbo commented Feb 12, 2021

The costs for postincrement are actually the same as without offset. The challenge here is to analyze all branches to fix the insns and ensure the original value isn't used elsewhere.

For new opcodes the binutils assemble needs an enhancement first. Then it's possible to add what the assembler supports.

@GunnarVB
Copy link

The costs for postincrement are actually the same as without offset.
The challenge here is to analyze all branches to fix the insns and
ensure the original value isn't used elsewhere.

Not sure I fully understand.
For the same of argument lets say that
(a0) = cost 2
(a0)+ = cost 2
d16(a0) = cost 3

With such cost database values, would then GCC not automatically pick "lower" cost code?

@GunnarVB
Copy link

I fully understand that a compiler can not always create the "perfect" code.

But maybe we can help the compiler at least to create a reasonable compromise code?

I think an instruction has several costs.
a) cost in cycles
b) cost in instruction length - which has an effect on cache hit rate, and on instruction fetch
as a CPU can not load infinite amount of bytes from Icache per cycle.
c) memory access. While a CPU might be able to do a Dcache access for low cost or even free. There is risk of cache miss and then high cost. So if there are 2 instruction producing the same result, then using an option without memory access is preferable.

Would a cost database like the following make some sense?

COST1 = cycles10
COST2 = instruction_length
2
COST3 = memory access *5

TOTAL COST = COST1+COST2+COST3

I believe that using an updated cost database for all instructions similar to the above would help GCC to create a "reasonable" selection of instruction.

Would it be possible to do this?

@bebbo
Copy link
Owner

bebbo commented Feb 12, 2021

gcc does not know full insns. It needs to know the cost for half insns but can distinguish between src an dst.

If the value 4 corresponds to 1 cycle you could e.g. define
src register -> 2
dst register -> 2
a move dx,dy would then yield the cost 4.

or you can define the cost of an operater with given parameters.
e.g. costs for AND = 0 on 68080 -> add yields same costs as move

This is for optimizing for speed. Optimizing for size currently uses the same costs for all insns, but the operand size would be a better criteria - someone could do that^^

to model memory access you'd have to define s state machine with pipelines, as I did for the 68080 floating point ops. Even more work :-)

@bebbo
Copy link
Owner

bebbo commented Feb 12, 2021

m68k_68000_10_costs.zip
have fun

bebbo added a commit that referenced this issue Feb 12, 2021
@bebbo
Copy link
Owner

bebbo commented Feb 12, 2021

if you use short for counters you get this now:
http://franke.ms/cex/z/9je3qj

        move.b (a3)+,d3
        jeq .L21
        move.w (a2,d3.l*2),(a0)
.L21:
        move.b (a3)+,d3
        jeq .L22
        move.w (a2,d3.l*2),(2,a0)
.L22:
        move.b (a3)+,d3
        jeq .L23
        move.w (a2,d3.l*2),(4,a0)
.L23:
        move.b (a3)+,d3
        jeq .L24
        move.w (a2,d3.l*2),(6,a0)
.L24:
        move.b (a3),d3
        jeq .L25
        move.w (a2,d3.l*2),(8,a0)
.L25:
        lea (10,a0),a0
        dbra d1,.L26
        jra .L56

bebbo added a commit that referenced this issue Feb 12, 2021
bebbo added a commit that referenced this issue Feb 12, 2021
labels in the autoinc path must either be visited before or must not
report the autoinc register as used.
@bebbo
Copy link
Owner

bebbo commented Feb 13, 2021

close pending :-)

@GunnarVB
Copy link

Very nice improvements, Thank you and well done!

Regarding the CostTable you send me.
We want to help you improve it.
What is the best way forward?

bebbo added a commit that referenced this issue Feb 13, 2021
keeps these out of loops, if the clr is not used inside of loop.
bebbo added a commit that referenced this issue Feb 13, 2021
- opt_strcpy is working again
- clr's are removed top to down, to have it less likely in loops
bebbo added a commit that referenced this issue Feb 14, 2021
@bebbo
Copy link
Owner

bebbo commented Feb 14, 2021

Very nice improvements, Thank you and well done!

Regarding the CostTable you send me.
We want to help you improve it.
What is the best way forward?

the best would be a working cost function for each cpu ^^

also good: a table which lists the source and dest costs for each operation and size. An example:

and.l d0,d1:
source: AND(d0, d1)
destination: d1

move.w (4,a1),d2
source: MEM(PLUS(a1, 4))
dest: d2

if destination costs are the same, the can be grouped.

costs for shift/rot/mul/div need a formula, but these should be quite ok already.

The cost for 1 cpu cycle is 4.
e.g. if the move.l d0,d1 takes one cycle, it should result in a cost of 4. and maybe the costs could be split up as src cost = 2 and dst cost = 2...

@GunnarVB
Copy link

Hi Bebbo

The cost for 1 cpu cycle is 4

Can we make this more finegrain?
Can we for example do the following:
Can we have 3 parameter

  1. Cost of Cycle
  2. Cost of instruction word
  3. Cost of Memory Access.

Maybe define like this
Ccycle = 9
Cword = 2
Cmemory= 7

with such formular we get such result
MOVEQ #1,D0 = 11
MOVE.L #23456,D0 = 15
MOVE.L -8(SP),D0 = 18

Could we do this. Or is it required that Cycle is always =4?

@bebbo
Copy link
Owner

bebbo commented Feb 14, 2021

The cost for 1 cpu cycle is 4

Can we make this more finegrain?

you can do many multiples of 4 as you like. you could also use a base cost of 4, but some locations in gcc still assume that the cost per cycle is 4... I recommend to scale all values by 4.

plus you never provide the cost for a full instruction, only the cost for the source or the destination - and it's hard to split 1 with integers properly...

Can we for example do the following:
Can we have 3 parameter

1. Cost of Cycle

the cost of cycle is fixed and it is 4.

2. Cost of instruction word

for -Os it makes sense to provide a size base cost.

3. Cost of Memory Access.

this is done for each source or destination. If it's a MEM(...) then the expression inside is taken and the addressing costs are return based on the used addressing mode.

Maybe define like this
Ccycle = 9
Cword = 2
Cmemory= 7

memory costs are maby const for 68080, but not for other cpus, right?

with such formular we get such result
MOVEQ #1,D0 = 11
MOVE.L #23456,D0 = 15
MOVE.L -8(SP),D0 = 18

Could we do this. Or is it required that Cycle is always =4?

as written above - if I rewrite the cost stuff, I'll use the value 4 as cost for 1 cycle. If the instruction needs 8 cycles on a 68000 it should yield 32.

@GunnarVB
Copy link

  1. Cost of instruction word
    for -Os it makes sense to provide a size base cost.

But instruction length always has also a cycle cost.
For 68000 each extra word cost 4 cycle
For 68020/030 each extra word cost 2 cycle

For 68040/68080 instructions length has less direct measurable cost.
But as longer the used instructions as higher the chances if consuming the icache and as higher the chances for caches misses = which cost then a lot for memory fetches.
Also each CPU has a Icache fetch /decoder limitation.
Its 2 Byte per 2 Cycle for 68020/68030 and 4 Byte per Cycle for 68060.
So as longer the instructions get as more cost are direct and indirect created.

I think its very important for the compiler to try to use the most shorted possible instruction for the job.

@bebbo
Copy link
Owner

bebbo commented Feb 14, 2021

I'm talking about the total cost of an instruction. all in. whatever. there is no extra cost for fetching or addressing, these costs are included. That's why for MEM some wierd calculation is used to determine the total cost:
: (ax), (ax)+
: -(ax)
: (n,ax)
: (n,ax,dy), ||n|| < 8
: (dx)
: (n,ax,dy.w/l*1/2/4/8)
: [(ax)]
: [(dx)]
...

and that's the challenge to model all-in costs for half instructions (src, dest) that the result for a full instruction is as close as possible to the real costs.

plus higher CPUs have worst, best and cache szenarios... I'd use values which close to the expected values - whatever this means...

and for -Os only the instruction size matters.

@bebbo bebbo closed this as completed Feb 15, 2021
bebbo added a commit that referenced this issue Feb 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
close pending the ticket will be closed soon, whether fixed or not
Projects
None yet
Development

No branches or pull requests

3 participants