xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0 #52

rojer · 2024-04-13T01:08:49Z

We are migrating from IDF 4.4.4 to 5.2.1 and among the many changes is the toolchgain update from GCC 8.4.0 (xtensa-esp32-elf-gcc8_4_0-esp-2021r2-patch5) to 13.2.0 (xtensa-esp-elf-13.2.0_20230928).

Unfortunately, it seems that it comes with an across the board regression in the output binary size - many functions gain size, resulting in overall binary size increase, most critically we are bumping into IRAM size limits on some of our apps.

By comparing code generated with the different toolchains, i identified the following problems:

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.
the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.
sometimes this code is duplicated where previous result could be reused.

On the other hand, there are positives too:

inlined loop-based memset: size-neutral and eliminates a call. neat!
better register allocation: using a10 avoids an extra move when it's used again as an argument to a function call.

This was after just a quick look at one particular function: spi_flash_chip_winbond_page_program, it gained 10 bytes. 4 bytes of those are accounted for by initialization of .flags, but even without that the function comes out 2 bytes bigger. Here are the notes from my analysis.

The text was updated successfully, but these errors were encountered:

rojer · 2024-04-13T19:57:38Z

ok, some of the movi /movi.n difference can be attributed to jump address alignment and the denisty can be improved with -mno-target-align. this still leaves one movi where movi.ncould be used - in the struct initialzation block.
however, size reduction is significant: 8K in our case.

espressif/crosstool-NG#52

Update test_tasks.cpp Workaround strange std::chrono::steady_clock behaviours going back to millis() updating tests Compiler flags tweaks for size opt espressif/crosstool-NG#52 small changes more constexpr Update settings.json Removed -mno-target-align

Lapshin · 2024-06-10T18:34:11Z

@rojer , sorry for the long delay.

Recently binary size was fixed with linker script changes:

Could you please check your project with these changes?

rojer · 2024-06-10T19:29:42Z

well, those seem unrelated to the issues i reported here that are about assembly code generation.
that said - thank you for letting me know, i will see if i can test with these changes soon.

Lapshin · 2024-06-11T02:44:22Z

@rojer , sure, these changes are unrelated but that also reduces binary size significantly after the toolchain upgrade. I didn't have time to take a look at your points yet, will do this a bit later.

@jcmvbkbc , maybe you can answer them quickly?

jcmvbkbc · 2024-06-11T08:12:07Z

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.

This is not decided by the compiler alone, the assembler and linker will transform instructions between narrow and standard form to maintain alignment/fill the gaps created by relaxations. Meaningful comparison has to take a lot of factors into an account

the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.

I believe that it was meant to be that way with the patch https://gcc.gnu.org/git/gcc.git?h=cd02f15f1aecc45b2c2feae16840503549508619 I don't see a way to affect this compiler decision by any command line option. @jjsuwa-sys3175 Suwa-san, what do you think?

jjsuwa-sys3175 · 2024-06-11T08:28:43Z

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.

This is not decided by the compiler alone, the assembler and linker will transform instructions between narrow and standard form to maintain alignment/fill the gaps created by relaxations. Meaningful comparison has to take a lot of factors into an account

I agree so.

the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.

and 4 bytes of litpool entry :)

I believe that it was meant to be that way with the patch https://gcc.gnu.org/git/gcc.git?h=cd02f15f1aecc45b2c2feae16840503549508619 I don't see a way to affect this compiler decision by any command line option. @jjsuwa-sys3175 Suwa-san, what do you think?

Maybe it would be better to have an option to control whether or not constant synthesis is performed individually.

rojer · 2024-06-11T08:33:16Z

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

jjsuwa-sys3175 · 2024-06-11T08:51:36Z

This may not be an excuse, but for the ESP8266 Arduino core, these patches including the constant synthesis work well in terms of reducing size and clock cycles.

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

As it happens, I have a WIP patch that does exactly what you describe :)

The reason why that patch has been left in WIP for so long is that gcc's built-in litpool mechanism provides duplicate elimination but does not provide recording the number of duplicates and referencing them later. So, I have to create them myself, which is so ugly :-(

rojer · 2024-06-11T21:20:59Z

@Lapshin

Could you please check your project with these changes?

we are using 5.2.1, and these were kind of difficult to apply after the big refactoring in espressif/esp-idf@40be44f but i tried to follow the spirit if not the letter

espressif/esp-idf@9375348

discarding eh_frame if exceptions are not enabled - nice, significant reduction of ~10K

espressif/esp-idf@2b36636

couldn't apply this to app, no visible effect on bootloader

espressif/esp-idf@afffbd9

~500 byte saved

Lapshin · 2024-06-12T08:24:41Z

we are using 5.2.1

@rojer , you may consider switching to bugfix version 5.2.2 which has these two commits espressif/esp-idf@1f3f65b , espressif/esp-idf@dcf6b54

Regarding the third change - it's already backported to release/v5.2 branch (but changes are not synced with github yet). They seem to appear soon

no visible effect on bootloader

This would be a breaking change, will be added in v6.0

Lapshin · 2024-06-13T04:59:31Z

@rojer , you could check espressif/esp-idf@5320ec2 in release/v5.2 branch

rojer · 2024-06-14T15:02:11Z

@Lapshin I did a quick test with current release/v5.2 (espressif/esp-idf@e5ffb3c), and results are interesting:

Xtensa (ESP32) target binary size went down by almost 50K: 1632032 -> 1581616
Similar RISC-V (ESP32-C3) target went up by almost 12K: 1623536 -> 1635792

needless to say, i'm quite happy with (1) and not so much with (2) :)
do you have an explanation for why this may be the case?

Lapshin · 2024-06-15T07:34:27Z

@rojer , I also did not expect this T_T

You can analyze why it happens for your particular project using command idf_size.py --files --diff OLD_BUILD_MAP_FILE NEW_BUILD_MAP_FILE or with mapuche that has gui

jjsuwa-sys3175 · 2024-06-19T09:21:51Z

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

Please see gcc-mirror/gcc@2314108, I hope this is what you wanted :)

rojer · 2024-06-19T22:29:53Z

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

jjsuwa-sys3175 · 2024-06-20T04:16:34Z

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

All I can say is: I am not a party to that :)

Lapshin · 2024-06-20T08:59:20Z

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

@rojer , from the IDF side we are waiting for GCC 14.2 release with bugfixes to make the new release. It looks like this will be available from v5.4

@jjsuwa-sys3175 , thank you for giving the solution quickly!

jjsuwa-sys3175 · 2024-06-20T11:07:42Z

Sorry for this being completely off-topic, but I would like to mention the use of "-mextra-l32r-costs=" in situations where speed is more important than size.

Reading data from the instruction memory, such as L32R instruction, may require implementation-dependent additional clocks in addition to 1 clock execution for the instruction itself and 1 clock delay due to the pipeline until the read result is available.

For example, on the ESP8266, a delay of 4 or 5 clocks is observed for each L32R instructions that cannot be hidden by overlapping with other instructions.

Therefore, by compiling with "-O2 -mextra-l32r-costs=5", constant synthesis (avoiding the use of time-consuming L32Rs) can be performed more aggressively (the default value of "-mextra-l32r-costs=" is 0, so not specifying it means that there are no additional clocks for the execution of L32R).

rojer · 2024-06-20T11:54:57Z

@jjsuwa-sys3175 thanks for the additional context, it's good to know. however, at least in our environment, we almost exclusively operate under the memory pressure, both RAM and flash, while cycles are essentially free or near free.

rojer · 2024-06-25T21:02:36Z

@Lapshin

You can analyze why it happens for your particular project using command idf_size.py --files --diff OLD_BUILD_MAP_FILE NEW_BUILD_MAP_FILE

see attached diffs for ESP32 and ESP32-C3. what do you see?
on C3 i see 10K+ of wireless bloat... sorry, innovation :) at the same time, ESP32 lost a bunch of fat.

diff_esp32.txt
diff_esp32c3.txt

rojer · 2024-06-25T21:27:02Z

also, looks like idf_size.py has a bug computing .rodata size - no way pthread.c uses 150K+ of rodata. it seems that the first entry gets bogus size.

Lapshin · 2024-10-03T13:21:17Z

@rojer , sorry for the long delay. Latest news:

Toolchain with GCC 14.2.0 was released
- It has a change gcc-mirror/gcc@2314108

Regarding .rodata size you can learn here (link):

The size of the .rodata section in the Flash Data memory type may appear very large for a single archive. This occurs due to linker relaxations. The linker may attempt to combine object file sections with MERGE and STRINGS flags from all archives into one to perform tail string optimization. Consequently, one archive may end up with a very large .rodata section, containing string literals from other archives. This is evident in the .rodata section of the libesp_app_format.a archive. The specific compiler behavior here can be turned off by enabling CONFIG_COMPILER_NO_MERGE_CONSTANTS option (only for GCC toolchain), please read help for more details.

I will take a look at your diff files soon

Lapshin · 2024-10-07T12:28:42Z

@rojer , as I can see you already created issue espressif/esp-idf#14076 for the part that has grown the most.

Sorry, but I can't help you here because I'm not a member of bluetooth/wifi team. But from the IDF perspective you still have some options to free some space. For example:

Disable CONFIG_ESP_CONSOLE_SECONDARY_USB_SERIAL_JTAG if you don't need this (save about 3200 bytes).
Disable CONFIG_VFS_SUPPORT_TERMIOS (save ~2000 bytes)
The full list you can learn on this page.
this page in case there is lack of free IRAM

rojer · 2024-11-10T20:15:10Z

sorry for the delay, finally had some spare time to test the new toolchain.

GCC 13.2.0 -> 14.2.0: nice 14.6K size reduction on Xtensa, no change on RISC-V
CONFIG_ESP_CONSOLE_SECONDARY_USB_SERIAL_JTAG - n/a (we're using 5.2.2, will keep in mind for the future)
CONFIG_VFS_SUPPORT_TERMIOS=n - additional 5.5K saved on xtensa, and 6K saved on risc-v - great!

i will also take a look at the docs

ok, i'm going to close this one, i am satisfied with the improvements made to the toolchain and config, we're good.
thanks @Lapshin and @jjsuwa-sys3175 for your help, much appreciated!

rojer · 2024-11-10T23:25:42Z

@Lapshin i also found some time to investigate another size-related issue, for which so far we've been using a workaround: during update from 4.4 to 5.2, our c++ code (and our app is primarily c++) ballooned significantly.
i found a workaround: keeping cxx_std at gnu++17 instead of gnu++2b made it go away, but didn't investigate at the time.
well, today i did, and found the reason: #67

while c++17 is ok, we'd very much like to move on to newer and shinier things but we can't, because of the size bloat.
could you please have a look?

while looking, i found that actually, switching from c++17 to c++2b should be beneficial for the size as there are more size reductions than increases: for the test app i'm using, 869 symbols shrunk in size compared to 474 that grew. unfortunately, the std::string-related bloat seems to have drowned out all the improvements made elsewhere and net gain in size is about 16K for us, which is a lot.

PBrunot added a commit to fablab-bergamo/Fab-O-matic that referenced this issue May 24, 2024

Compiler flags tweaks for size opt

0207038

espressif/crosstool-NG#52

rojer closed this as completed Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0 #52

xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0 #52

rojer commented Apr 13, 2024

rojer commented Apr 13, 2024

Lapshin commented Jun 10, 2024 •

edited

Loading

rojer commented Jun 10, 2024

Lapshin commented Jun 11, 2024

jcmvbkbc commented Jun 11, 2024

jjsuwa-sys3175 commented Jun 11, 2024

rojer commented Jun 11, 2024

jjsuwa-sys3175 commented Jun 11, 2024 •

edited

Loading

rojer commented Jun 11, 2024

Lapshin commented Jun 12, 2024 •

edited

Loading

Lapshin commented Jun 13, 2024

rojer commented Jun 14, 2024 •

edited

Loading

Lapshin commented Jun 15, 2024 •

edited

Loading

jjsuwa-sys3175 commented Jun 19, 2024

rojer commented Jun 19, 2024

jjsuwa-sys3175 commented Jun 20, 2024

Lapshin commented Jun 20, 2024 •

edited

Loading

jjsuwa-sys3175 commented Jun 20, 2024

rojer commented Jun 20, 2024

rojer commented Jun 25, 2024

rojer commented Jun 25, 2024

Lapshin commented Oct 3, 2024

Lapshin commented Oct 7, 2024 •

edited

Loading

rojer commented Nov 10, 2024 •

edited

Loading

rojer commented Nov 10, 2024

xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0 #52

xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0 #52

Comments

rojer commented Apr 13, 2024

rojer commented Apr 13, 2024

Lapshin commented Jun 10, 2024 • edited Loading

rojer commented Jun 10, 2024

Lapshin commented Jun 11, 2024

jcmvbkbc commented Jun 11, 2024

jjsuwa-sys3175 commented Jun 11, 2024

rojer commented Jun 11, 2024

jjsuwa-sys3175 commented Jun 11, 2024 • edited Loading

rojer commented Jun 11, 2024

Lapshin commented Jun 12, 2024 • edited Loading

Lapshin commented Jun 13, 2024

rojer commented Jun 14, 2024 • edited Loading

Lapshin commented Jun 15, 2024 • edited Loading

jjsuwa-sys3175 commented Jun 19, 2024

rojer commented Jun 19, 2024

jjsuwa-sys3175 commented Jun 20, 2024

Lapshin commented Jun 20, 2024 • edited Loading

jjsuwa-sys3175 commented Jun 20, 2024

rojer commented Jun 20, 2024

rojer commented Jun 25, 2024

rojer commented Jun 25, 2024

Lapshin commented Oct 3, 2024

Lapshin commented Oct 7, 2024 • edited Loading

rojer commented Nov 10, 2024 • edited Loading

rojer commented Nov 10, 2024

Lapshin commented Jun 10, 2024 •

edited

Loading

jjsuwa-sys3175 commented Jun 11, 2024 •

edited

Loading

Lapshin commented Jun 12, 2024 •

edited

Loading

rojer commented Jun 14, 2024 •

edited

Loading

Lapshin commented Jun 15, 2024 •

edited

Loading

Lapshin commented Jun 20, 2024 •

edited

Loading

Lapshin commented Oct 7, 2024 •

edited

Loading

rojer commented Nov 10, 2024 •

edited

Loading