-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving asm generation #140
Comments
can you provide the C code snippet? Maybe even on https://franke.ms/cex/ ? |
Looks good, you did this or -m68040 does that? |
I changed nothing yet |
I have gcc6 version 6.5.0b 200815191133 (GCC) |
So g++ is generating really bad code comparing to gcc. |
I don't see a difference between C and C++: |
build was stalled... running a new Windows build atm. |
Nice. |
This version is even slower 54 fps vs 57fps -_- https://www.textcompare.org/index.html?id=60203858c7a2ab00178f6c3c Maybe the difference comes from the RetroEngine.hpp header? LTO doesn't work: Maybe it just need to be set somehere? |
do you have an "easy" compilable version of this project where I could measure stuff myself? |
Looking at the difference I'd guess you are using some |
Do you have a Vampire4 ? |
I'm not using -fbbb=- |
uhm - have to check that for windows... |
I used WSL but I lost all files there... |
You can use the Linux Eclipse in Windows, simple use e.g. VcXSrv as XWindows server - https://sourceforge.net/projects/vcxsrv/files/vcxsrv/ ... but that's not related ti this issue. Back to the topic: The code shown in CompilerExplorer looks better because (luckily) the registers high bytes are zero and the optimizer omits the |
I managed to reproduce the issue on cex: http://franke.ms/cex/z/rW3n66 |
progress: https://franke.ms/cex/z/8xecqW |
very nice, that was it! |
before each assignment of a byte or word from mem into a previously unused register, clear that register. This helps the existing optimizations to eliminate 'and' for unsigned extends. The 'clr' insns without effect are also removed. This leads to better code: http://franke.ms/cex/z/8xecqW Also fixed a bug in the usage analyzer.
There is nothing wrong with using a DATA register as INDEX, But for LOOP counts its generally always best to use a DATA REGISTER |
this code makes no sense for higher CPUs |
I'd guess: all data registers are used inside the loop... |
also lets look at this:
what does this 3 line do? this ASM code as it - is silly about the LOOP look here:
the compiler has D4 free and uses it as TMP register in the Loop the Compiler should have done |
d4 is used as scratch inside the loop: jeq .L23
move.w (a2,d4.l*2),(4,a1) or what code are you referring to? The other issue is presumably create during jump optimization, where the expression is duplicated from above and d2 with some value inside. At the copied location d2 is set to zero. |
Hi Stefan
can you make us again a link for the online tool to see the program?
So that we can all share the same code and discuss?
Cheers
Gunnar
Quoting bebbo <notifications@github.com>:
…>
>
> Hallo Bebbo,
>
> ```
> moveq #99,d3
> ```
>
> .L6:
> move.l _xScrollOffset,(a0)+
> dbra d3,.L6
> clr.w d3
> subq.l #1,d3
> jcc .L6
>
> This looks nice but would GCC be able to "SEE" if a constant fits
> in WORD (below 65536)
> and then only just use the DBRA without the out loop?
you could use a `short` variable instead of an `int` --> you'd get
the pure `dbra`
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#140 (comment)
|
I notice something else which looks not right: C-Code: Created ASM: I think this ASM is not good as the "literal" translation of the C-Code would be. The code that GCC created is 4 bytes longer, 1 instruction longer, and several clocks slower. |
Hi Stefan, please ignore the above post. The correct report is this: The created code looks like this: .L24: GCC prefers to use (16bitoffset,Ptr) instead using (PTR)++, The (AN)++ address mode does not increase code size. |
|
Hi Bebbo, I have two more question: 1st) Can we help you in some way?
|
The costs for postincrement are actually the same as without offset. The challenge here is to analyze all branches to fix the insns and ensure the original value isn't used elsewhere. For new opcodes the binutils assemble needs an enhancement first. Then it's possible to add what the assembler supports. |
Not sure I fully understand. With such cost database values, would then GCC not automatically pick "lower" cost code? |
I fully understand that a compiler can not always create the "perfect" code. But maybe we can help the compiler at least to create a reasonable compromise code? I think an instruction has several costs. Would a cost database like the following make some sense? COST1 = cycles10 TOTAL COST = COST1+COST2+COST3 I believe that using an updated cost database for all instructions similar to the above would help GCC to create a "reasonable" selection of instruction. Would it be possible to do this? |
gcc does not know full insns. It needs to know the cost for half insns but can distinguish between src an dst. If the value 4 corresponds to 1 cycle you could e.g. define or you can define the cost of an operater with given parameters. This is for optimizing for speed. Optimizing for size currently uses the same costs for all insns, but the operand size would be a better criteria - someone could do that^^ to model memory access you'd have to define s state machine with pipelines, as I did for the 68080 floating point ops. Even more work :-) |
m68k_68000_10_costs.zip |
if you use
|
labels in the autoinc path must either be visited before or must not report the autoinc register as used.
close pending :-) |
Very nice improvements, Thank you and well done! Regarding the CostTable you send me. |
keeps these out of loops, if the clr is not used inside of loop.
- opt_strcpy is working again - clr's are removed top to down, to have it less likely in loops
the best would be a working cost function for each cpu ^^ also good: a table which lists the source and dest costs for each operation and size. An example:
if destination costs are the same, the can be grouped. costs for shift/rot/mul/div need a formula, but these should be quite ok already. The cost for 1 cpu cycle is 4. |
Hi Bebbo
Can we make this more finegrain?
Maybe define like this with such formular we get such result Could we do this. Or is it required that Cycle is always =4? |
you can do many multiples of 4 as you like. you could also use a base cost of 4, but some locations in gcc still assume that the cost per cycle is 4... I recommend to scale all values by 4. plus you never provide the cost for a full instruction, only the cost for the source or the destination - and it's hard to split 1 with integers properly...
the cost of cycle is fixed and it is 4.
for -Os it makes sense to provide a size base cost.
this is done for each source or destination. If it's a MEM(...) then the expression inside is taken and the addressing costs are return based on the used addressing mode.
memory costs are maby const for 68080, but not for other cpus, right?
as written above - if I rewrite the cost stuff, I'll use the value 4 as cost for 1 cycle. If the instruction needs 8 cycles on a 68000 it should yield 32. |
But instruction length always has also a cycle cost. For 68040/68080 instructions length has less direct measurable cost. I think its very important for the compiler to try to use the most shorted possible instruction for the job. |
I'm talking about the total cost of an instruction. all in. whatever. there is no extra cost for fetching or addressing, these costs are included. That's why for MEM some wierd calculation is used to determine the total cost: and that's the challenge to model all-in costs for half instructions (src, dest) that the result for a full instruction is as close as possible to the real costs. plus higher CPUs have worst, best and cache szenarios... I'd use values which close to the expected values - whatever this means... and for -Os only the instruction size matters. |
Hi, bebbo.
We are working on Sonic game port to Amiga and we noticed that generated asm code is really bad...
By simple tweaking asm we got 10fps more!
Here is comparison between original asm generated with O2, m68080 and fbbb with our little tweak:
https://www.textcompare.org/index.html?id=601fb8ead316300017a2c11f
As you can see and.l is responsible here for major CPU slowdown.
Do you think this could be tweaked?
The text was updated successfully, but these errors were encountered: