LOOP construct good with Os but not optimal with -O2 -O3 -OFast #217

GunnarVB · 2024-01-08T12:31:14Z

Hallo Bebbo,
I hope you are OK.

Maybe this was reported before.

I found that the basic loop construct loop very good when compiled with -Os
but looks not optimal when compiled with any other O mode

C Example:

void memclr (short length, char * ptr)
{
 for(;length--;){
   *ptr++= 0;
 }
}

compile with -mregparm=2 -Os

_memclr:
        jra .L2
.L3:
        clr.b (a0)+
.L2:
        dbra d0,.L3
        rts

Good result!
4 instructions total.
Bra to DBRA - this is both short and fast.

not optimal result when compile with -O2 or -O3 -OFast

_memclr:
        move.w d0,d1
        subq.w bebbo/amiga-gcc#1,d1
        tst.w d0
        jeq .L1
.L5:
        clr.b (a0)+
        dbra d1,.L5
.L1:
        rts

8 instructions total
4 instruction header instead 1 BR
This result is not good.
the BRA to the DBRA was much better

Hello Bebbo,
do you know a way to enable the BRA to the DBRA in all -O options?
This would be very good for code size and for performance on all 68K members!

Many thanks in advance

regards Gunnar

The text was updated successfully, but these errors were encountered:

bebbo · 2024-01-08T13:23:18Z

you might consider using -fno-tree-ch to avoid the duplication of loop conditions.

see http://franke.ms/cex/z/W9s3rb

GunnarVB · 2024-01-08T13:43:48Z

-OS creates:

        jra .L2
.L3:
        clr.b (a0)+
.L2:
        dbra d0,.L3

This is good.
The LOOP is only 2 instructions.

The BRA has nothing to predict.
Branch prediction is not needed here and this is optimal fast and small.

The O2 version

_memclr:
        move.w d0,d1
        subq.w #1,d1
        tst.w d0
        jeq .L1
.L5:
        clr.b (a0)+
        dbra d1,.L5

This needs 5 instructions instead 2! for the loop
The beq is "unsure" and needs be predicted
This can cause misprediction.
This code is really not optimal.

using your proposed extra "flag"

_memclr:
        dbra d0,.L3
.L6:
        rts
.L3:
        clr.b (a0)+
        dbra d0,.L3
        jra .L6

we have 3 instructions instead 2.
This makes 10 byte instead 6 byte code size.
The DBRA at first would be predicted per default on 68k to be backward taken.
But the default run is forward ... this is not optimal.
Yes this option is less bad than the current O2.
But really not as good as the Os.

Could the way the Os goes be done always?

GunnarVB · 2024-01-08T13:49:01Z

move.w d0,d1
subq.w #1,d1
tst.w d0
jeq .L1

next question : Why is there a TST instruction in this code?
The tst is not needed, it could be done like this?

move.w d0,d1
jeq .L1
subq.w #1,d1

also set -freorder-blocks-algorithm=simple as default.

bebbo transferred this issue from bebbo/amiga-gcc Jan 8, 2024

bebbo added the won't fix label Jan 8, 2024

bebbo added a commit that referenced this issue Jan 8, 2024

refs #217: do not enable -ftree-ch by default

9505524

also set -freorder-blocks-algorithm=simple as default.

bebbo closed this as completed Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LOOP construct good with Os but not optimal with -O2 -O3 -OFast #217

LOOP construct good with Os but not optimal with -O2 -O3 -OFast #217

GunnarVB commented Jan 8, 2024 •

edited by bebbo

Loading

bebbo commented Jan 8, 2024

GunnarVB commented Jan 8, 2024 •

edited by bebbo

Loading

GunnarVB commented Jan 8, 2024

LOOP construct good with Os but not optimal with -O2 -O3 -OFast #217

LOOP construct good with Os but not optimal with -O2 -O3 -OFast #217

Comments

GunnarVB commented Jan 8, 2024 • edited by bebbo Loading

I found that the basic loop construct loop very good when compiled with -Os but looks not optimal when compiled with any other O mode

compile with -mregparm=2 -Os

Good result! 4 instructions total. Bra to DBRA - this is both short and fast.

not optimal result when compile with -O2 or -O3 -OFast

8 instructions total 4 instruction header instead 1 BR This result is not good. the BRA to the DBRA was much better

bebbo commented Jan 8, 2024

GunnarVB commented Jan 8, 2024 • edited by bebbo Loading

-OS creates:

This is good. The LOOP is only 2 instructions.

The BRA has nothing to predict. Branch prediction is not needed here and this is optimal fast and small.

This needs 5 instructions instead 2! for the loop The beq is "unsure" and needs be predicted This can cause misprediction. This code is really not optimal.

we have 3 instructions instead 2. This makes 10 byte instead 6 byte code size. The DBRA at first would be predicted per default on 68k to be backward taken. But the default run is forward ... this is not optimal. Yes this option is less bad than the current O2. But really not as good as the Os.

GunnarVB commented Jan 8, 2024

GunnarVB commented Jan 8, 2024 •

edited by bebbo

Loading

I found that the basic loop construct loop very good when compiled with -Os
but looks not optimal when compiled with any other O mode

Good result!
4 instructions total.
Bra to DBRA - this is both short and fast.

8 instructions total
4 instruction header instead 1 BR
This result is not good.
the BRA to the DBRA was much better

GunnarVB commented Jan 8, 2024 •

edited by bebbo

Loading

This is good.
The LOOP is only 2 instructions.

The BRA has nothing to predict.
Branch prediction is not needed here and this is optimal fast and small.

This needs 5 instructions instead 2! for the loop
The beq is "unsure" and needs be predicted
This can cause misprediction.
This code is really not optimal.

we have 3 instructions instead 2.
This makes 10 byte instead 6 byte code size.
The DBRA at first would be predicted per default on 68k to be backward taken.
But the default run is forward ... this is not optimal.
Yes this option is less bad than the current O2.
But really not as good as the Os.