Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0 #52

Closed
rojer opened this issue Apr 13, 2024 · 25 comments
Closed

xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0 #52

rojer opened this issue Apr 13, 2024 · 25 comments

Comments

@rojer
Copy link

rojer commented Apr 13, 2024

We are migrating from IDF 4.4.4 to 5.2.1 and among the many changes is the toolchgain update from GCC 8.4.0 (xtensa-esp32-elf-gcc8_4_0-esp-2021r2-patch5) to 13.2.0 (xtensa-esp-elf-13.2.0_20230928).

Unfortunately, it seems that it comes with an across the board regression in the output binary size - many functions gain size, resulting in overall binary size increase, most critically we are bumping into IRAM size limits on some of our apps.

By comparing code generated with the different toolchains, i identified the following problems:

  • movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.
  • the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.
  • sometimes this code is duplicated where previous result could be reused.

On the other hand, there are positives too:

  • inlined loop-based memset: size-neutral and eliminates a call. neat!
  • better register allocation: using a10 avoids an extra move when it's used again as an argument to a function call.

This was after just a quick look at one particular function: spi_flash_chip_winbond_page_program, it gained 10 bytes. 4 bytes of those are accounted for by initialization of .flags, but even without that the function comes out 2 bytes bigger. Here are the notes from my analysis.

@rojer
Copy link
Author

rojer commented Apr 13, 2024

ok, some of the movi /movi.n difference can be attributed to jump address alignment and the denisty can be improved with -mno-target-align. this still leaves one movi where movi.ncould be used - in the struct initialzation block.
however, size reduction is significant: 8K in our case.

PBrunot added a commit to fablab-bergamo/Fab-O-matic that referenced this issue May 24, 2024
PBrunot added a commit to fablab-bergamo/Fab-O-matic that referenced this issue May 25, 2024
Update test_tasks.cpp
Workaround strange std::chrono::steady_clock behaviours
going back to millis()
updating tests
Compiler flags tweaks for size opt

espressif/crosstool-NG#52

small changes

more constexpr

Update settings.json

Removed -mno-target-align
PBrunot added a commit to fablab-bergamo/Fab-O-matic that referenced this issue May 25, 2024
Update test_tasks.cpp
Workaround strange std::chrono::steady_clock behaviours
going back to millis()
updating tests
Compiler flags tweaks for size opt

espressif/crosstool-NG#52

small changes

more constexpr

Update settings.json

Removed -mno-target-align
PBrunot added a commit to fablab-bergamo/Fab-O-matic that referenced this issue May 26, 2024
Update test_tasks.cpp
Workaround strange std::chrono::steady_clock behaviours
going back to millis()
updating tests
Compiler flags tweaks for size opt

espressif/crosstool-NG#52

small changes

more constexpr

Update settings.json

Removed -mno-target-align
@Lapshin
Copy link
Collaborator

Lapshin commented Jun 10, 2024

@rojer , sorry for the long delay.

Recently binary size was fixed with linker script changes:

Could you please check your project with these changes?

@rojer
Copy link
Author

rojer commented Jun 10, 2024

well, those seem unrelated to the issues i reported here that are about assembly code generation.
that said - thank you for letting me know, i will see if i can test with these changes soon.

@Lapshin
Copy link
Collaborator

Lapshin commented Jun 11, 2024

@rojer , sure, these changes are unrelated but that also reduces binary size significantly after the toolchain upgrade. I didn't have time to take a look at your points yet, will do this a bit later.

@jcmvbkbc , maybe you can answer them quickly?

@jcmvbkbc
Copy link

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.

This is not decided by the compiler alone, the assembler and linker will transform instructions between narrow and standard form to maintain alignment/fill the gaps created by relaxations. Meaningful comparison has to take a lot of factors into an account

the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.

I believe that it was meant to be that way with the patch https://gcc.gnu.org/git/gcc.git?h=cd02f15f1aecc45b2c2feae16840503549508619 I don't see a way to affect this compiler decision by any command line option. @jjsuwa-sys3175 Suwa-san, what do you think?

@jjsuwa-sys3175
Copy link

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.

This is not decided by the compiler alone, the assembler and linker will transform instructions between narrow and standard form to maintain alignment/fill the gaps created by relaxations. Meaningful comparison has to take a lot of factors into an account

I agree so.

the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.

and 4 bytes of litpool entry :)

I believe that it was meant to be that way with the patch https://gcc.gnu.org/git/gcc.git?h=cd02f15f1aecc45b2c2feae16840503549508619 I don't see a way to affect this compiler decision by any command line option. @jjsuwa-sys3175 Suwa-san, what do you think?

Maybe it would be better to have an option to control whether or not constant synthesis is performed individually.

@rojer
Copy link
Author

rojer commented Jun 11, 2024

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

@jjsuwa-sys3175
Copy link

jjsuwa-sys3175 commented Jun 11, 2024

This may not be an excuse, but for the ESP8266 Arduino core, these patches including the constant synthesis work well in terms of reducing size and clock cycles.

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

As it happens, I have a WIP patch that does exactly what you describe :)

The reason why that patch has been left in WIP for so long is that gcc's built-in litpool mechanism provides duplicate elimination but does not provide recording the number of duplicates and referencing them later. So, I have to create them myself, which is so ugly :-(

@rojer
Copy link
Author

rojer commented Jun 11, 2024

@Lapshin

Could you please check your project with these changes?

we are using 5.2.1, and these were kind of difficult to apply after the big refactoring in espressif/esp-idf@40be44f but i tried to follow the spirit if not the letter

discarding eh_frame if exceptions are not enabled - nice, significant reduction of ~10K

couldn't apply this to app, no visible effect on bootloader

~500 byte saved

@Lapshin
Copy link
Collaborator

Lapshin commented Jun 12, 2024

we are using 5.2.1

@rojer , you may consider switching to bugfix version 5.2.2 which has these two commits espressif/esp-idf@1f3f65b , espressif/esp-idf@dcf6b54

Regarding the third change - it's already backported to release/v5.2 branch (but changes are not synced with github yet). They seem to appear soon

no visible effect on bootloader

This would be a breaking change, will be added in v6.0

@Lapshin
Copy link
Collaborator

Lapshin commented Jun 13, 2024

@rojer , you could check espressif/esp-idf@5320ec2 in release/v5.2 branch

@rojer
Copy link
Author

rojer commented Jun 14, 2024

@Lapshin I did a quick test with current release/v5.2 (espressif/esp-idf@e5ffb3c), and results are interesting:

  1. Xtensa (ESP32) target binary size went down by almost 50K: 1632032 -> 1581616
  2. Similar RISC-V (ESP32-C3) target went up by almost 12K: 1623536 -> 1635792

needless to say, i'm quite happy with (1) and not so much with (2) :)
do you have an explanation for why this may be the case?

@Lapshin
Copy link
Collaborator

Lapshin commented Jun 15, 2024

@rojer , I also did not expect this T_T

You can analyze why it happens for your particular project using command idf_size.py --files --diff OLD_BUILD_MAP_FILE NEW_BUILD_MAP_FILE or with mapuche that has gui

@jjsuwa-sys3175
Copy link

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

Please see gcc-mirror/gcc@2314108, I hope this is what you wanted :)

@rojer
Copy link
Author

rojer commented Jun 19, 2024

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

@jjsuwa-sys3175
Copy link

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

All I can say is: I am not a party to that :)

@Lapshin
Copy link
Collaborator

Lapshin commented Jun 20, 2024

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

@rojer , from the IDF side we are waiting for GCC 14.2 release with bugfixes to make the new release. It looks like this will be available from v5.4

@jjsuwa-sys3175 , thank you for giving the solution quickly!

@jjsuwa-sys3175
Copy link

Sorry for this being completely off-topic, but I would like to mention the use of "-mextra-l32r-costs=" in situations where speed is more important than size.

Reading data from the instruction memory, such as L32R instruction, may require implementation-dependent additional clocks in addition to 1 clock execution for the instruction itself and 1 clock delay due to the pipeline until the read result is available.

For example, on the ESP8266, a delay of 4 or 5 clocks is observed for each L32R instructions that cannot be hidden by overlapping with other instructions.

Therefore, by compiling with "-O2 -mextra-l32r-costs=5", constant synthesis (avoiding the use of time-consuming L32Rs) can be performed more aggressively (the default value of "-mextra-l32r-costs=" is 0, so not specifying it means that there are no additional clocks for the execution of L32R).

@rojer
Copy link
Author

rojer commented Jun 20, 2024

@jjsuwa-sys3175 thanks for the additional context, it's good to know. however, at least in our environment, we almost exclusively operate under the memory pressure, both RAM and flash, while cycles are essentially free or near free.

@rojer
Copy link
Author

rojer commented Jun 25, 2024

@Lapshin

You can analyze why it happens for your particular project using command idf_size.py --files --diff OLD_BUILD_MAP_FILE NEW_BUILD_MAP_FILE

see attached diffs for ESP32 and ESP32-C3. what do you see?
on C3 i see 10K+ of wireless bloat... sorry, innovation :) at the same time, ESP32 lost a bunch of fat.

diff_esp32.txt
diff_esp32c3.txt

@rojer
Copy link
Author

rojer commented Jun 25, 2024

also, looks like idf_size.py has a bug computing .rodata size - no way pthread.c uses 150K+ of rodata. it seems that the first entry gets bogus size.

@Lapshin
Copy link
Collaborator

Lapshin commented Oct 3, 2024

@rojer , sorry for the long delay. Latest news:

Regarding .rodata size you can learn here (link):

The size of the .rodata section in the Flash Data memory type may appear very large for a single archive. This occurs due to linker relaxations. The linker may attempt to combine object file sections with MERGE and STRINGS flags from all archives into one to perform tail string optimization. Consequently, one archive may end up with a very large .rodata section, containing string literals from other archives. This is evident in the .rodata section of the libesp_app_format.a archive. The specific compiler behavior here can be turned off by enabling CONFIG_COMPILER_NO_MERGE_CONSTANTS option (only for GCC toolchain), please read help for more details.

I will take a look at your diff files soon

@Lapshin
Copy link
Collaborator

Lapshin commented Oct 7, 2024

@rojer , as I can see you already created issue espressif/esp-idf#14076 for the part that has grown the most.

Sorry, but I can't help you here because I'm not a member of bluetooth/wifi team. But from the IDF perspective you still have some options to free some space. For example:

  • Disable CONFIG_ESP_CONSOLE_SECONDARY_USB_SERIAL_JTAG if you don't need this (save about 3200 bytes).
  • Disable CONFIG_VFS_SUPPORT_TERMIOS (save ~2000 bytes)
  • The full list you can learn on this page.
  • this page in case there is lack of free IRAM

@rojer
Copy link
Author

rojer commented Nov 10, 2024

sorry for the delay, finally had some spare time to test the new toolchain.

  • GCC 13.2.0 -> 14.2.0: nice 14.6K size reduction on Xtensa, no change on RISC-V
  • CONFIG_ESP_CONSOLE_SECONDARY_USB_SERIAL_JTAG - n/a (we're using 5.2.2, will keep in mind for the future)
  • CONFIG_VFS_SUPPORT_TERMIOS=n - additional 5.5K saved on xtensa, and 6K saved on risc-v - great!

i will also take a look at the docs

ok, i'm going to close this one, i am satisfied with the improvements made to the toolchain and config, we're good.
thanks @Lapshin and @jjsuwa-sys3175 for your help, much appreciated!

@rojer rojer closed this as completed Nov 10, 2024
@rojer
Copy link
Author

rojer commented Nov 10, 2024

@Lapshin i also found some time to investigate another size-related issue, for which so far we've been using a workaround: during update from 4.4 to 5.2, our c++ code (and our app is primarily c++) ballooned significantly.
i found a workaround: keeping cxx_std at gnu++17 instead of gnu++2b made it go away, but didn't investigate at the time.
well, today i did, and found the reason: #67

while c++17 is ok, we'd very much like to move on to newer and shinier things but we can't, because of the size bloat.
could you please have a look?

while looking, i found that actually, switching from c++17 to c++2b should be beneficial for the size as there are more size reductions than increases: for the test app i'm using, 869 symbols shrunk in size compared to 474 that grew. unfortunately, the std::string-related bloat seems to have drowned out all the improvements made elsewhere and net gain in size is about 16K for us, which is a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants