Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XCC RG-2017.8-linux: generated object code is affected by -g2 level (the default level) + longer source paths #7114

Closed
marc-hb opened this issue Feb 17, 2023 · 16 comments
Labels
bug Something isn't working as expected P3 Low-impact bugs or features won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc)

Comments

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 17, 2023

tl;dr: if you're failing to reproduce a build, try a longer or shorter west topdir source directory. That's the workaround. There seems to be a threshold somewhere around 27 characters.

Summary

A very strange and complex toolchain issue where the -g level and debug symbols affect the object code on some Intel CI build system(s) for a totally unknown reason.

Unexpected variations like this break build reproducibility = how this issue was discovered in the first place.

Variations like these could also make it harder to reproduce race conditions and other difficult bugs.

I focused on .text.k_ticks_to_us_floor64 and RG-2017.8-linux below because it just happened to be the first difference I found but I noticed that the -g level also affects .text.z_swap_irqlock in the same conditions and there could be unexpected differences in other files.

=> Don't trust RG-2017.8-linux to produce consistent object code

This is not a problem with the Zephyr SDK toochain: it produces identical object code across Linux and Windows, as routinely verified in CI across all targets: https://github.com/thesofproject/sof/actions/runs/4189511798/jobs/7262044804

I haven't observed any problem with RI-2020.5-linux (and MTL) yet which Zephyr still identifies as GNU 4.2.2. Note there is on-going work to switch to xt-clang.

Longer story

Reproduced on SOF commit 64fe2ea + west update + sof/scripts/xtensa-build-zephyr.py -p tgl

Reproduced with a couple other SOF commits too.

The code generated for the k_ticks_to_us_floor64() normally looks like this:

xt-objdump -s --section=.text.k_ticks_to_us_floor64 ../build-tgl/zephyr/kernel/CMakeFiles/kernel.dir/sched.c.obj

Contents of section .text.k_ticks_to_us_floor64:
 0000 3661000c 05b10000 61000041 00000c07  6a......a..A....
 0010 b08682b0 b6a2a2c8 fe87ba01 1bbb0c0d  ................
 0020 c2afff81 0000e008 00a0d4a2 a0e582a0  ................
 0030 c482b0f4 827cfb0c 0afaddea dd16fd07  .....|..........
 0040 9cea40a2 8250d282 40b2a240 c382dabb  ..@..P..@..@....
 0050 cabbdd07 cd068100 00e00800 3d0b2d0a  ............=.-.
 0060 1df0bd03 ad02dd07 cd068100 00e00800  ................
 0070 b0d482a0 c582a0b4 a2a0a482 dabbcabb  ................
 0080 dd07cd06 810000e0 0800cd06 dd07a911  ................
 0090 b901ad02 bd038100 00e00800 3801a095  ............8...
 00a0 82e811b0 b482a084 a2a02482 ba882a2e  ..........$...*.
 00b0 9a888a33 e7b2151b 331df000 00000000  ...3....3.......   <== padding / alignment
 00c0 ceebb0c0 41f73f0d 0c1a86dc ff1df0    ....A.?........ 

=> Note the five 00 bytes of padding.

@andyross, who answered A LOT of questions during this investigation (thank you!), suspects xt-as adds this padding for cache alignment and performance reasons. This padding is there in most but unfortunately not all cases, which is a bug.

You don't need SOF to observe this code and padding, you can produce it with a pure upstream zephyr workspace:

ZEPHYR_TOOLCHAIN_VARIANT=xcc

XTENSA_TOOLCHAIN_PATH=$XCCLOC/install/tools
TOOLCHAIN_VER=RG-2017.8-linux
XTENSA_SYSTEM=$XCCLOC/install/builds/RG-2017.8-linux/cavs2x_LX6HiFi3_2017_8/config
export ZEPHYR_TOOLCHAIN_VARIANT XTENSA_TOOLCHAIN_PATH TOOLCHAIN_VER XTENSA_SYSTEM

cd zephyr/
git checkout d9c4ec31fc49e7eef3c
west build -p -b intel_adsp_cavs25 samples/hello_world/ -- -DCONFIG_SYS_CLOCK_TICKS_PER_SEC=15000 -DCONFIG_SPEED_OPTIMIZATIONS=y
xt-objdump -s --section=.text.k_ticks_to_us_floor64 zephyr/kernel/CMakeFiles/kernel.dir/sched.c.obj

Make sure you use ZEPHYR_TOOLCHAIN_VARIANT=xcc and other variables below. If .text.k_ticks_to_us_floor64 is missing then you probably forgot to switch to that toolchain.

I could unfortunately not reproduce this issue with hello_world in any configuration; so far reproduction requires SOF.

On at least two of our automated build systems (sofbld07 and 08, Ubuntu 20.04), the padding disappears in the default configuration. This breaks build reproducibility. There are other differences in sched.c.obj caused by -g (apparently not in other .c.obj files, which don't have 150+ ELF sections)

EDIT: this happens because these systems use longer source paths / debug symbols, that's what triggering this compiler bug.

When unspecified with -g, the default debug level is '-g2'. All other things equal, the high debug level set by default by Zephyr makes the usual padding disappear on these particular systems.

When decreasing the debug level in cmake/compiler/gcc/compiler_flags.cmake to -g1 or -g0 or no -g at all and making no other change whatsoever, the usual padding re-appears! In other words, a lower -g debug level makes the object code that runs on the DSP normal again.

--- a/cmake/compiler/gcc/compiler_flags.cmake
+++ b/cmake/compiler/gcc/compiler_flags.cmake
@@ -168,11 +168,11 @@ check_set_compiler_property(APPEND PROPERTY hosted -fno-freestanding)
 check_set_compiler_property(PROPERTY freestanding -ffreestanding)
 
 # Flag to enable debugging
-set_compiler_property(PROPERTY debug -g)
+set_compiler_property(PROPERTY debug -g1 ) 
 
 # GCC 11 by default emits DWARF version 5 which cannot be parsed by

For quick testing use:

cd build-tgl/
touch ../zephyr/kernel/sched.c

ninja -j1 -v

... then copy/paste and edit the line that compiles sched.c

The difference appears during the -S sched.i -> sched.s compilation step.

The compiled sched-g1.s and sched-g2.s have no actual difference in the assembly code, they only have a lot of .byte lines which are different. Somehow these bytes affects the assembly phase and the object code on those systems.

The pre-processed files sched-g1.i and sched-g2.i are strictly identical.

Another proof that the -g level triggers the issue: "hiding" the zephyr source code during the -S sched.i -> sched.s step restores the usual padding. stracing with -e openat the compiler at that step shows that it reads hundreds pf source files when using -g (even at the -g1 level that preserves the usual padding).

I tried to reproduce on a few other systems including another Ubuntu 20.04 but did not: the padding is always there no matter what I tried, only a few systems are affected. I'm of course using the very same toolchain.

I could not reproduce on the "guilty" systems with samples/hello_world/ either: with hello_world the padding is always there too. Too few debug symbols in hello_world to trigger this bug?

I unfortunately couldn't find what's unusual about these CI systems. Whatever makes them special, I don't think that, all other things being equal, the generated object code should EVER differ between -g1 and -g2 in any circumstance. It should even less differ on some systems but not on others when using the very same toolchain. So this qualifies as a compiler bug IMHO and makes that toolchain untrustworthy for reproducible builds.

cc:

@marc-hb marc-hb added the bug Something isn't working as expected label Feb 17, 2023
@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 17, 2023

Some relevant code

include/zephyr/sys/time_units.h:#define Z_HZ_ticks CONFIG_SYS_CLOCK_TICKS_PER_SEC

static TIME_CONSTEXPR inline uint64_t k_ticks_to_us_floor64(uint64_t t)
{
        /* Generated.  Do not edit.  See above. */
        return z_tmcvt(t, Z_HZ_ticks, Z_HZ_us, true, false, false, false);
}

sched.s

   ...
.LBB16_k_ticks_to_us_floor64:	# 0xaa
	addi.n	a3,a3,1                	# [0]  
	.loc	19	1190	0
	retw.n                        	# [1]  

.LBB7_k_ticks_to_us_floor64:	# 0xae
#<freq> BB:7 => BB:8 probability = 0.50000
#<freq> BB:7 => BB:9 probability = 0.50000
	.frequency 0.000 0.160
	.loc	19	150	0
 {	# format ae_format2
	bltu.w18	a11,a12,.Lt_3_19458  	# [0]  
	nop                           	#  
 }

.Lt_3_19714:	# 0xb6
	movi.n	a10,1                  	# [0]  
	j	.Lt_3_19458                 	# [1]  

.Lt_3_19970:	# 0xbb
	.frequency 0.000 0.092
	.loc	19	1190	0
	retw.n                        	# [0]  

After assembly and disassembly


  b0:	889a                	add.n	a8, a8, a9
  b2:	338a                	add.n	a3, a3, a8
  b4:	15b2e7               	bgeu	a2, a14, cd <k_ticks_to_us_floor64+0xcd>
  b7:	331b                	addi.n	a3, a3, 1
  b9:	f01d                	retw.n

000000bb <k_ticks_to_us_floor64+0xbb>:
  bb:	00000000 b0ebce00                                .....   <= 5x '00' padding sometimes lost 

000000c0 <k_ticks_to_us_floor64+0xc0>:
  c0:	0d3ff741c0b0ebce 	{ bltu.w18	a11, a12, 40 <k_ticks_to_us_floor64+0x40>; nop }
  c8:	1a0c                	movi.n	a10, 1
  ca:	ffdc86               	j	40 <k_ticks_to_us_floor64+0x40>

000000cd <k_ticks_to_us_floor64+0xcd>:
  cd:	f01d                	retw.n

Dump WITHOUT the padding. Some offsets are adjust accordingly.

 0a62cda8 <.text.k_ticks_to_us_floor64>:
  a62cda8:      0c006136 0000b105 41000061 070c0000     6a......a..A....
  a62cdb8:      b08286b0 c8a2a2b6 01ba87fe 0d0cbb1b     ................
  a62cdc8:      81ffafc2 08e00000 a2d4a000 a082e5a0     ................
- a62cdd8:      f4b082c4 0cfb7c82 eaddfa0a 07ad16dd     .....|..........
  a62cde8:      a240ea9c 82d25082 40a2b240 bbda82c3     ..@..P..@..@....
  a62cdf8:      07ddbbca 008106cd 0008e000 0a2d0b3d     ............=.-.
  a62ce08:      03bdf01d 07dd02ad 008106cd 0008e000     ................
  a62ce18:      a082d4b0 b4a082c5 82a4a0a2 bbcabbda     ................
  a62ce28:      06cd07dd e0000081 06cd0008 11a907dd     ................
  a62ce38:      02ad01b9 008103bd 0008e000 95a00138     ............8...
  a62ce48:      b011e882 84a082b4 8224a0a2 2e2a88ba     ..........$...*.
- a62ce58:      338a889a 1b10b2e7 cef01d33 4110b0eb     ...3....3......A
- a62ce68:      0c0d3ff8 ffddc61a 0000f01d               .?........

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 17, 2023

I asked people in my team to run the xt-objdump command above. Most of them have the padding but (at least) one other person was also missing the padding in the default configuration.

As soon as he lower the debug level to -g1 the padding was restored, exactly like sofbld07 and sofbld08. So these two CI systems are "special" but no so special.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 18, 2023

I compared all the toolchain files between sofbld07 which drops the padding by default and my workstation which does not. A number of text files in the toolchain have absolute installation paths in them. After a mass search/replace of these installation paths, all the files in the toolchain were byte for byte identical across the two machines.

@lgirdwood
Copy link
Member

@marc-hb XCC bugs go to vendor.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 21, 2023

The differences between build machines is actually just a difference in (debug) path lengths. I think I can now reproduce the object code change on any system as long as west topdir | wc is higher than 27 characters. Again lowering the debug level to -g1 or -g0 avoids the object code variation.

@marc-hb marc-hb changed the title XCC RG-2017.8-linux: generated object code is affected by the -gN level on Intel CI build systems XCC RG-2017.8-linux: generated object code is affected by the -g2 level + longer source paths Feb 21, 2023
@marc-hb marc-hb changed the title XCC RG-2017.8-linux: generated object code is affected by the -g2 level + longer source paths XCC RG-2017.8-linux: generated object code is affected by -g2 level (the default level) + longer source paths Feb 21, 2023
@andyross
Copy link
Contributor

FWIW: I have a setup internally where I have xt-xcc/clang version RI-2021.6 built for all the SOF targets (and a handful of other tool versions, e.g. the original targets for all the devices). If you have a simple-ish test case it wouldn't be much of a hardship to try it across the cross product of the version/device space to try to isolate the problem.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 21, 2023

If you have a simple-ish test case

I tried again to reproduce without SOF but still could not. Maybe because the number of ELF sections is much lower? RI-2021.6 does not seem affected but I did not try to "stress" it.

Reproducing with RG-2017.8-linux and SOF is now extremely simple. I understand it's time-consuming to install that toolchain if you don't have it yet.

@andyross
Copy link
Contributor

I guess what I was thinking of as a minimal test was "here's an preprocessed C file and some other files read by the toolchain in a tarball, unpack it into /opt/foo and it works, but /opt/very___long___name/ fails".

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 21, 2023

Never a dull moment! sof commit a942f10 ("west.yml: upgrade Zephyr to 0c0d73721ed") added ~1000 new zephyr commits which seems to change everything; now the padding seems always dropped?!?

Let's ignore that, too many variables already. Let's focus on the original bug description which is unchanged in sof commit 7fdc623.

Also, while I could never caught it "red-handed", I recommended temporarily uninstalling ccache to stay on the safe side. This issue has too many variables already.

The attached sched.* files were created with the following commands ran in two directories. The two directories have strictly identical content, the only difference is one has a longer name.

/tmp/zprj/sof/
/tmp/long_enough_directory_name/sof/

The object code padding is found when compiling SOF in the short path /tmp/zprj/ and it is missing when compiling in the long path /tmp/long_enough_directory_name/.

This cannot be anything but a toolchain bug.

git -C sof checkout 7fdc62375a72
west update
west status
rm -rf build-* && time sof/scripts/xtensa-build-zephyr.py  -p tgl -j 200

I'm attaching sched.i, sched.s and sched.o for both directories. They were generated thanks to cd build-tgl; touch ../zephyr/kernel/sched.c, ninja -j1 -v and modifying and re-running the xt-xcc sched.c command displayed.

missing_padding_sched.zip

Expected debug path differences aside, the two sched.i files are strictly identical.

The two sched.s have only .byte and .file differences, the actual assembly code is the same. They seem enough to reproduce the issue anywhere else, see below.

Yet the two sched.o have object code differences, notably this missing padding (but not just)

This very short xt-xcc command is enough to reproduce on another system the two sched.o files from their respective sched.s sources without playing any path game:

RG-2017.8/xt-xcc -c -mlongcalls sched.s 

md5sum *.o
8ccdc4347fdb76f9e83bd547981c433f  long_sched.o
9647d93e79ab7841a2ce001f53d28345  short_sched.o

valgrind --trace-children=yes reports only small leaks in xt-xcc and xt-as, nothing really worrying.

@andyross
Copy link
Contributor

Nice, that's a smoking gun. I'll run these against all my local toolchain variants tomorrow to see if we can figure out which version(s) and/or core(s) are at fault.

Note that the ".byte" directive is the offending bit for sure, that's an instruction to assemble the list of numbers that follow into the stream, probably that's how the compiler has decided to express padding. The ".file" directive just specifies the name of the file, which obviously shouldn't change the generated code; GNU as documents it as a noop; my guess is xt-as does so too.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 22, 2023

Note that the ".byte" directive is the offending bit for sure, that's an instruction to assemble the list of numbers that follow into the stream, probably that's how the compiler has decided to express padding

The .byte directives are found only at the very end of the assembly, very far away from the k_ticks_to_us_floor64 code and .section, and only inside the .debug_xxx .sections.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 22, 2023

I filed new case # 46682483 at https://support.cadence.com

@marc-hb
Copy link
Collaborator Author

marc-hb commented Mar 2, 2023

... and the reply is unsurprisingly "upgrade your toolchain". Which will likely never happen for TGL, @lgirdwood confirm?

@marc-hb
Copy link
Collaborator Author

marc-hb commented Mar 7, 2023

After internal discussion we're not going to upgrade the toolchain and perform a full round of validation for a reproducibility issue "only". We hope there is no other side effect.

Moreover there is a (funny!) workaround for the reproducibility issue: make your source directory longer or shorter.

Closing as wontfix.

@marc-hb marc-hb closed this as completed Mar 7, 2023
@marc-hb marc-hb added the won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc) label Mar 7, 2023
@marc-hb
Copy link
Collaborator Author

marc-hb commented Mar 24, 2023

I think I can now reproduce the object code change on any system as long as west topdir | wc is higher than 27 characters.

I tested again with the same sof commit 64fe2ea and I confirm that there is a west topdir length threshold between 27 and 28 characters. There are other thresholds above 28 characters but none below 27.

Then I tested again with much later sof commit ac91071 and I could not reproduce this bug at all! west topdir lengths below 50 all produce the same output! Go figure...

git checkout ac910714f40a8d3 && west update
time ./scripts/xtensa-build-zephyr.py -j100 -p  tgl

build-sof-staging 
|-- sof 
|   |-- community 
|   +-- sof-tgl.ldc 	sha256=b203c4bf2f25ffe00f3742a75a688fbdcef643eb5fe880b14c12fa64cf1cb64a
|-- sof-info 
|   +-- tgl 
|       |-- boot.mod.gz 	sha256=ccb453b1b673b1704b4ece2e748478756f2cbe57da0945174b42b13385148650
|       |-- config.gz 
|       |-- generated_configs.c.gz 	sha256=ab016934dfaac3cd323aef9c40f5d119303ab83605b7b0c5c9c5579d3fd52975
|       |-- main.mod.gz 
|       |-- sof_versions.h 	sha256=182e4390f3c0f0120cca5e509a1a84aad7e399b05bcc1bb7a74dc819d2e37642
|       |-- stripped-main.elf.gz 	sha256=f22b40d12becc579d849089d29ea2cb3aabce116d0dd8c6cc1fb6ab0d6e04c06
|       |-- stripped-zephyr.elf.gz 	sha256=a74324a19c3e2d2a813ca26928af0ab0d48a19937f65c287c131bb90593d60bc
|       |-- zephyr.elf.gz 
|       |-- zephyr.lst.gz 	sha256=1ab15bff5b421f9a8eca063f4af6fb154b4baa2b8e1f422ca5893f8dc6295f04
|       |-- zephyr.map.gz 
|       +-- zephyr_version.h 	sha256=0332406ca0968edc99eab65a985656483da53a549125bd5adeed428803cb8d24

For v2.5, the west topdir length threshold is between 35 and 36.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Mar 24, 2023

The recent introduction of CONFIG_ASSERT in commit 4d67d2f means that ./scripts/xtensa-build-zephyr.py --debug ... will now always be different across systems and users, even for the stripped output. See

Totally different issue and not really a "bug".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected P3 Low-impact bugs or features won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc)
Projects
None yet
Development

No branches or pull requests

3 participants