Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOOP construct good with Os but not optimal with -O2 -O3 -OFast #217

Closed
GunnarVB opened this issue Jan 8, 2024 · 3 comments
Closed

LOOP construct good with Os but not optimal with -O2 -O3 -OFast #217

GunnarVB opened this issue Jan 8, 2024 · 3 comments

Comments

@GunnarVB
Copy link

GunnarVB commented Jan 8, 2024

Hallo Bebbo,
I hope you are OK.

Maybe this was reported before.

I found that the basic loop construct loop very good when compiled with -Os
but looks not optimal when compiled with any other O mode

C Example:

void memclr (short length, char * ptr)
{
 for(;length--;){
   *ptr++= 0;
 }
}

compile with -mregparm=2 -Os

_memclr:
        jra .L2
.L3:
        clr.b (a0)+
.L2:
        dbra d0,.L3
        rts

Good result!
4 instructions total.
Bra to DBRA - this is both short and fast.

not optimal result when compile with -O2 or -O3 -OFast

_memclr:
        move.w d0,d1
        subq.w bebbo/amiga-gcc#1,d1
        tst.w d0
        jeq .L1
.L5:
        clr.b (a0)+
        dbra d1,.L5
.L1:
        rts

8 instructions total
4 instruction header instead 1 BR
This result is not good.
the BRA to the DBRA was much better

Hello Bebbo,
do you know a way to enable the BRA to the DBRA in all -O options?
This would be very good for code size and for performance on all 68K members!

Many thanks in advance

regards Gunnar

@bebbo bebbo transferred this issue from bebbo/amiga-gcc Jan 8, 2024
@bebbo
Copy link
Owner

bebbo commented Jan 8, 2024

you might consider using -fno-tree-ch to avoid the duplication of loop conditions.

see http://franke.ms/cex/z/W9s3rb

@bebbo bebbo added the won't fix label Jan 8, 2024
@GunnarVB
Copy link
Author

GunnarVB commented Jan 8, 2024

-OS creates:

        jra .L2
.L3:
        clr.b (a0)+
.L2:
        dbra d0,.L3

This is good.
The LOOP is only 2 instructions.

The BRA has nothing to predict.
Branch prediction is not needed here and this is optimal fast and small.

The O2 version

_memclr:
        move.w d0,d1
        subq.w #1,d1
        tst.w d0
        jeq .L1
.L5:
        clr.b (a0)+
        dbra d1,.L5

This needs 5 instructions instead 2! for the loop
The beq is "unsure" and needs be predicted
This can cause misprediction.
This code is really not optimal.

using your proposed extra "flag"

_memclr:
        dbra d0,.L3
.L6:
        rts
.L3:
        clr.b (a0)+
        dbra d0,.L3
        jra .L6

we have 3 instructions instead 2.
This makes 10 byte instead 6 byte code size.
The DBRA at first would be predicted per default on 68k to be backward taken.
But the default run is forward ... this is not optimal.
Yes this option is less bad than the current O2.
But really not as good as the Os.

Could the way the Os goes be done always?

@GunnarVB
Copy link
Author

GunnarVB commented Jan 8, 2024


move.w d0,d1
subq.w #1,d1
tst.w d0
jeq .L1


next question : Why is there a TST instruction in this code?
The tst is not needed, it could be done like this?

move.w d0,d1
jeq .L1
subq.w #1,d1

bebbo added a commit that referenced this issue Jan 8, 2024
    also set -freorder-blocks-algorithm=simple as default.
@bebbo bebbo closed this as completed Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants