-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question from notes on Apollo core forum #20
Comments
There are 2 issues.
Maybe the bfextu insn is not modelled correctly to set the flags.
changing the variable type to int, solves the and issue (in gcc6) - and surprisingly the tst ins too.
Now I also understand why there is the tst insn:
BUT, if programmers understand how a compiler works, it could help.
m68k-amigaos-gcc -S -Os -m68040 -fomit-frame-pointer -mregparm=2 bfext.c
|
The gcc-6 compiler can yield very good results! |
The 68k is not RISC where the register width is the only datatype width. Even RISC has realized this is inefficient as can be seen with the evolution from Alpha to AArch64 for example. It is not always a bad coding practice to use datatype sizes less than the registers size unless the compiler has poor support for them. Could GCC support smaller datatypes better? bfextu
I understand if other 68k compiler improvements are more important but if GCC can't do a better job then perhaps the problem is as much a bad compiler as the code is bad. This is not a criticism of you, your work or your opinion. Certainly any improvements are better than none and much appreciated. |
The current implementation tracks the CCR as it is. And since the CCR has no support for b/w/l flags there is nothing to improve in the gcc model. gcc tracks the values plus the CCR. It does not track the unused area of registers. For sure one could write an optimizer pass which tracks zero flags for the unused register parts and elimintes superfluous instructions based on that information. And in this case - plus many other cases - it helps a lot if the developer would review the asm code. Here an example about strncmp vs strncmp
While this is looking quite short, it's not efficient:
The code is not optimal, but the issue is: again there is some char to int extension. The programmer did not care enough and since the result is an int, int's are used to early.
here the int conversion is performed before return.
It would certainly be great if a compiler could also generate optimal assembler from the above code. And it's easy to blame the bad compilers. There are just a lot of bad whiny developers. PS: expect different results per platform - there is no best C-code for all platforms. |
I wish this would include your compiler: https://gcc.godbolt.org/#
It would certainly help a lot with optimizations.
…On Thu, Feb 8, 2018 at 12:21 PM, bebbo ***@***.***> wrote:
The current implementation tracks the CCR as it is. And since the CCR has
no support for b/w/l flags there is nothing to improve in the gcc model.
gcc tracks the values plus the CCR. It does not track the unused area of
registers.
For sure one could write an optimizer pass which tracks zero flags for the
unused register parts and elimintes superfluous instructions based on that
information.
And in this case - plus many other cases - it helps a lot if the developer
would review the asm code.
The developer makes the difference, not the compiler!
Here an example about strncmp vs strncmp
int strncmp(const char *s1,const char *s2,size_t n)
{ unsigned char *p1=(unsigned char *)s1,*p2=(unsigned char *)s2;
unsigned long r,c;
if ((r=n))
do;while(r=*p1++,c=*p2++,!(r-=c) && (char)c && --n);
return r;
}
While this is looking quite short, it's not efficient:
_strncmp:
movem.l a2/d3/d2,-(sp)
move.l 16(sp),a1
move.l 24(sp),a0
move.l a0,d0
jeq .L4
moveq #0,d1
.L3:
moveq #0,d0
move.b (a1,d1.l),d0
move.l 20(sp),a2
move.b (a2,d1.l),d2
moveq #0,d3
move.b d2,d3
sub.l d3,d0
jne .L2
tst.b d2
jeq .L2
addq.l #1,d1
cmp.l a0,d1
jne .L3
.L2:
movem.l (sp)+,d2/d3/a2
rts
.L4:
moveq #0,d0
jra .L2
The code is not optimal, but the issue is: again there is some char to int
extension. The programmer did not care enough and since the result is an
int, int's are used to early.
int strncmp(const char *p1,const char *p2, size_t n)
{
for(;;) {
if (n--) {
char c = *p1++;
if (c) {
if (c == *p2++)
continue;
} else
++p2;
unsigned char a = *--p1;
unsigned char b = *--p2;
return (short)a - b;
}
return 0;
}
}
here the int conversion is performed before return.
Also note the nested if statements which help the compiler to create an
optimal loop:
_strncmp:
move.l 4(sp),a1
move.l 8(sp),a0
move.l a1,d1
add.l 12(sp),d1
.L4:
cmp.l a1,d1
jeq .L6
move.b (a1)+,d0
jeq .L3
cmp.b (a0)+,d0
jeq .L4
.L5:
and.l #255,d0
moveq #0,d1
move.b -1(a0),d1
sub.l d1,d0
rts
.L3:
addq.l #1,a0
jra .L5
.L6:
moveq #0,d0
rts
It would certainly be great if a compiler could also generate optimal
assembler from the above code.
And it's easy to blame the bad compilers.
There are just a lot of bad whiny developers.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AIXahcA0tEqeh1qQs9H2BPMLL5dAP232ks5tSticgaJpZM4R8VO5>
.
|
bebbo wrote:
Surely GCC tracks the condition codes for the current integer datatype size (as opposed to the whole register). The bfextu instruction sets the correct CCR[Z] for the datatype if the bf it grabs fits in the datatype. The fact that only the CCR[Z] flag is valid is not much different than the BTST/BCHG/BCLR/BSET instructions which only affect the CCR[Z] flag.
I can usually make a compiler generate good code by reviewing the code (a good idea) but I shouldn't have to. Wasn't C supposed to optimize for the target CPU and good programming practice is to avoid hardware specific optimizations? Should I start reviewing the Netsurf 68k disassembly for places where the code could be better written to produce better code on the 68k?
The mismatched datatype sizes practically call for zero extensions which is poor coding but GCC also does a poor job of handling the integer promotion and partial register accesses. Is it because the 68k backend is shared with the ColdFire which wants to use mvz/mvs instructions?
If the GCC 68k backend generated reasonably good code for the "poor source", it would give something like this...
The good source and resulting code is also less efficient because of integer promotion issues. Better would be the following...
I did not extend the return value in d0 at the end of the function. This allows for more optimal code but the question is whether the AT&T System V ABI for the 68k allows it. Most ABIs I have looked at do not specify what happens to the unused portion of the return value register when a datatype smaller than a register is used (one exception is the PPC ABI which defines the unused bits as trash). I could not find a copy of the 68k System V ABI anywhere so I do not know for certain. Some other compilers are also less efficient because they sign/zero extend the return value before returning from the function and sign/zero extend the return value after returning from the function. I believe it would be more efficient to sign/zero extend after the function (on a read from the stack), especially on the 68k without a MVS/MVZ instruction like the ColdFire.
PPC was supposed to replace CISC with it's great compiler support and moving all the CPU complexity into the compiler. At the same time Intel was optimizing their CPUs for bad code. Maybe compilers can't handle bad source code. I'm sure there is no end to whiny developers writing bad code on other platforms. The few Amiga programmers left spend most of their time porting the bad code from other platforms. Oh well, there was little hope of a 68k revival without new mass produced 68k CPUs anyway. The only people trying to design 68k CPUs do not care about making compiler design easier. |
right, the compiler could do a bit better. And you can't omit the extension from byte to long, since the result type of this function is |
I understand the difficulties of improving the 68k backend. It could take years for a good programmer and many testers to bring it up to the quality of the x86/x86_64 backend.
Doh! You are correct and what strncmp() returns is a standard that can't be changed. Your better source code was strange with the (short) typecast which called for an early integer promotion. Wouldn't it be better to have written...
An 8 bit subtraction would be better for the 68000-68040 (the 68060 is a wash without mvs/mvz instructions even though it prefers 32 bit int operations and CF requires integer promotion lacking .b and .w operations). The resulting 68k code would still need a big fat and.l #$ff,d0 before the rts lacking a CF mvz.b d0,d0. How much easier do you think it would be to write a good 68k backend if a modern 68k CPU had the mvs/mvz instructions? |
well, the cast can be omitted - doesn't matter if int oder short is used, it's superfluous. Zero extend move insns should be feasible since coldfire has those. Guess you simply have to define the cpu and set this config bit for the new cpu... |
Guess it will work:
|
I have mentioned before how easy it should be to turn on existing CF instructions in backends with shared 68k/CF support (common). The retro 68k fpga CPU designers don't want anything new and the Apollo Core designer would rather add shiny hypothetical go faster bling. The AC68080 supported CF mvs/mvz instructions (as encoded in CF) at one time but they were removed. If you want to support the AC68080, you supposedly get integer unit Bn (global base relative) registers and En (extended shared SIMD unit) registers which are non-orthogonal with An and Dn registers and may require more instructions and give worse code density if used. A 64 bit 68k CPU really should have a movs.l and movz.l instructions as well and they do take considerable encoding space even though most modern CPUs have them (the x86_64 versions are fat because of lack of encoding space but the 68k has more free encoding space). Anyway, I do not represent, support or endorse the Apollo Core although I was at one time part of the "team" but our philosophies differed. The 68060 would have likely benefited significantly from the CF mvs/mvz instructions as it only has a 32 bit instruction fetch and prefers 32 bit operations for result forwarding. My analysis of existing 68k code only showed a 0%-4% code size reduction from using these instructions but that was only finding code sequences which could be replaced by the CF instructions and often the code analyzed was compiled for older 68k processors. I would be interested in hearing how much of a code reduction (compiled benchmarks and/or popular programs) comes from compiling for the 68060 but with mvs/mvz support added. I have been researching code density of which the 68k is one of the best. I have also been playing with more sane, dense and conservative 68k ISAs called 68k_32 and 68k_64. One of my focuses is providing easier and better compiler support which I view as necessary to leverage the 68k code density advantage. If it was that easy to turn on mvs/mvz support then some statistics of the results would be interesting. |
Well, I'll wait if there are new insn in the 68080. |
Hello Stefan - Had a quick discussion on the Apollo IRC channel. The Apollo 68080 core supports mvs/mvz, but in a backwards-compatible way. It recognizes combinations of instructions and "fuses" them to the internal equivalents of mvs/mvz. These execute in 0.5 clocks. MOVE.B (d16,An),Dn EXTB.L Dn MOVE.W (d16,An),Dn EXT.L Dn Similar idea for MVZ. MOVEQ + MOVE. Examples of some (but not all) of the instruction fusing logic are here: http://apollo-accelerators.com/wiki/doku.php/cpu_fuse?s[]=fusing |
gregthecanuck wrote:
The Apollo Core does not support mvs/mvz compatible instructions at either the binary or instruction level. It appears bebbo can easily have these CF instructions generated in Amiga executables but the Apollo Core would no longer recognize them (my understanding is that it would have not long ago). The Apollo Core claims to recognize and fold/fuse the following combinations.
The MVZ register to register is also useful.
It would have been interesting to see how much of a size difference the MVS/MVZ instructions would have made. The performance gains from code density improvements are usually underestimated and the gains from adding registers (especially non-orthogonal registers like the Apollo Core) are usually overestimated. Not only are the non-orthogonal registers giving no gains because they are unused as they would not be easy to add support in compilers but the percentage performance gains from more than 16 gp integer registers are likely in the low single digits, possibly less if code density deteriorates from using the new registers. Hardware engineers know best how to improve hypothetical performance! |
In a discussion thread here: http://www.apollo-core.com/knowledge.php?b=2¬e=12425
Gunnar makes a comment which I have pasted below. Does this still apply to the current gcc?
Thanks
======Comment============================================
GCC and other C compiler sometimes create relative bad results.
Lets give an example:
Below is part of the main render function in NEO-GEO emulator.
This code uses 5 instructions per pixel.
The BFEXTU does set the flags correctly.
GCC is nor aware if this - and put there an extra TST instruction to create the flags. This instruction is NOT needed and useless.
Then GCC puts there a AND.L #$FF,D0 to limit the result to BYTE range. The extracted value of BFEXTU is by the parameter already limited to Byte range. This means this AND is totally unneeded.
In short: of the 5 instruction - 2 are useless.
That GCC throws on 68K huge number of unneeded TST instructions is a known bug in GCC - which is caused by the laziness of the GCC developers - they simply did not care to implement a tracking of the flags created by instructions.
The text was updated successfully, but these errors were encountered: